GPU programming - project3 - final


FEM or Finite Element Methods are analytical techniques which have applications in a wide variety of fields like thermal and structural analysis,  ocean engineering,  biomechanics and fluid environments. Most widely used solvers for this method are CPU based and are really slow. This project aims at building a very basic implementation of this method with a major portion of the computations occuring on the GPU. 


An example of a 1D finite element formulation

What is the traditional approach?

The base for any FEM problem is a Partial Differential Equation (PDE) that needs to be solved. The solution for the PDE is obtained by discretizing the domain of the PDE into small finite elements which are individually subject to various constraints, solved and then assembled back into its global parent.

What is our test case in this project?

For this project, I chose to implement a simple FEM calculation of a 1D Non linear PDE.


 The non- linearity arises because of the u*du/dx component in the equation.  

F'(X0) * ( X* - X0 ) = - F(X0)



The boundary conditions specified for this case are

  •        f = -0.5*pi*cos(0.5*pi*x) + 2*sin(0.5*pi*x)*(1-cos(0.5*pi*x)/pi   (force varies with distance)
  •       u(0) = 0 (displacement at the left end = 0 )
  •       u'(1)=1
  •       p=1
  •       q=0

      The problem could be reduced to the form of a simple linear equation

      K*U =F

      Where K is a tridiagonal, symmetric stiffness matrix, U the deflection and
      F the force applied on the system.


      For each element, this works out to  

    Which is then assembled to get the global stiffness matrix.

What are the steps involved?

The following flow chart explains in a nutshell, the steps involved in  this application 

The solution is iterative in nature and requires a few iterations for the solution to converge.

Where does GPU come in the picture?

Most of what is being done today for these kind of calculations is totally serial in nature and requires a lot of processing time for situations with a really fine mesh or a large number of nodes (e.g stress calculations on a truck or a bridge). For such situations, the process could run for hours (sometimes days) to calculate the final set of values. Parallelizing a lot the computations could definitely boost performance and achieve

 Of the steps involved in the flowchart, the most number of computations happen at 3 - assembly and 5 -solution.  The goal in this project was to parallelize these two stages to rapidly assemble and solve the
the equations.    




 Existing implementations for solving the linear equations use Gauss elimination method for the solution.

Gauss elimination is extremely serial in nature involving sequential forward and back substitutions!. This prevents us from directly being able to move the code to a parallel implementation.

A suitable parallel solver is required for the solution phase to solve the tridiagonal matrix equation. Even though a good number of them have been developed for the MPI system, a lot of them cannot be implemented directly in CUDA due to its inefficient inter-thread communication system.

After trying out many possibilities, I finally settled on a JACOBI ITERATIVE SOLVER for solving the equations involved in the tridiagonal equation.

More details about this method could be found here and here

Why is Jacobi iteration best for us?


  • Input for each iteration is dependent only on the values from the previous iteration eliminating need for passing values between threads.
  • It is simple and easy to imlement
  • Jacobi solver requires the matrix aij to be diagonally dominent which luckily is the case for most 1d stiffness matrices.

Drawbacks of Jacobi:

  • Takes a lot more time to converge when compared to most other iterative solutions


  • Gauss Seidel Method - Possible if we are dealing with not more than a block since it requires usage of concurrent values in other threads
  • Cyclic reduction method - Possible but hard to implement




The application involves 3 kernels

  • 1 for the assembly
  • 2 for the Jacobi iterative solver (This could have easily been made to one if message passing was available)


Optimizations attempted:

  • Reduction of for loops in the kernels in every possible way.
  • combining all  data passed to kernels as float4 and float2.
  •  Attempts to replace "for" loops with "while" did not provide any performance increase when compared to calling the innerloop function twice (equivalent to loop unrolling).
  •  Loop unrolling decreased performance slightly. (here the loop is only twice. The benefit from loop unrolling was shadowed by the over head in duplication of code)
  • Usage of __mul24( ), __sinf( ) , __cosf( ) and other in built CUDA functions




Comparison of output from Serial and Parallel versions using Emulator...


Fig.a: Output from serial version for 10 elements (C++)



Fig.b: Output from parallel version for 10 elements (200 iterations on the Jacobi) (Emulator)

Though the solutions are close to each other, we observe that this difference in accuracy might not be acceptable for a lot of engineering applications.

Possible reasons for loss in accuracy:

  • Iterative solver- Accuracy depends on the number of iterations
  • Absence of double precision.


Comparison of output from Emulator Vs Debug (the card)

The solutions obtained from the emulator version worked out really close to the ones obtained in debug mode.


Output from emulator version

Output from Debug version

Time Comparison

The time obtaining the solution for this case was drastically reduced when the code ran on the device.

 a) Emulator version

Average time taken for assembly kernel = 1.623716 ms

Average time taken for solve kernel (300 iterations) = 820.028381 ms

click here to look at the Output image

b) Debug version

Average time taken for assembly kernel = 0.131169 ms (More than 12 times increase in speed)

Average time taken for solve kernel (300 iterations) = 4.830269 ms (more than 169 times increase in speed)

click here  to look at the Output image 




Problems with CUDA

What works on the  Emulator does not necessarily work on the card



We have witnessed one case where CUDA has provided a significant reduction in process time.

 Even though CUDA offers great potential for fast parallel computations, its usage in engineering applications is going to be limited due to the following reasons

  • No/Limited support for double precision. Applications that require high levels of accuracy in calculations (Fluid dynamics, stress analysis etc.) will have to compromise.
  • Limited sharing of data between threads on the fly limits usage to a very few number of parallel algorithms that support this system. Efforts in developing or finding these algorithms can delay the development process.

When these are overcome, CUDA will certainly make a huge difference in performance of such engineering applications.



The source code for this project could be found at the following link 


There are 2 versions in that .rar file

1) reassembled with nbody11  - submission version  - Is the one without the jacobi iteration implementation and works completely on the card

2)reassembled with nbody16 - non linear -  submission version  - Is the version that works fine on the

emulator but provides wrong values when run on the card.


 Hardware used:

 Intel Xeon 3.6 Ghz, 2.5GB RAM