# Parallel Programming Education

### Lewis and Clark College, Portland, OR-- Summer 2013. Kyle Barton, Sam Dodson, Danielle Fenske, Ben Whitehead, Professor Jens Mache

During our ten weeks of summer research, we studied three different parallel-capable programming languages in search of the language that is most conducive to teaching parallel concepts at the undergraduate level. We first chose to research a functional programming language, Haskell. Haskell is a purely functional, deterministic programming language. It has a static, strong, type system with automatic type inference. The upshot is that the output of its functions will always be the same regardless of whether it is running on one or multiple cores. This eliminates deadlock and race condition errors that can plague traditional approaches to and parallel programming. Additionally, the abstraction in the parallel functions in Haskell allows for the programs to run on a range of hardware.

We wrote a simple parallel Haskell assignment to be used in the classroom. We have submitted the entire teaching module to the National Science Foundation funded website, CSinParallel, and we hope to have it posted on the site soon.

As an introduction to the Haskell programming language, below are sequential and parallel implementations of the Quicksort algorithm. For the readers not familiar with this sorting algorithm, a description of it follows. The algorithm sorts a list of integers. An element from the list is chosen to be the pivot value. A sublist of all the elements less than the pivot is created, and recursively sorted. A sublist of all the elements greater than or equal to the pivot is also created, and recursively sorted. Then the two sorted sublists are appended.

##### Sequential Quicksort

The sequential version of Quicksort is seen as one of the best examples of the elegance of the Haskell programming language, because implementations of the Quicksort algorithm in other programming languages, such as C, C++ or Java, are usually much longer.

Starting at line 2, the code sequentially executes, going through each case until it reaches one that applies to the given situation. The first guard, on line 2, describes the case of a non-empty list. The last case, on line 5, checks for an empty list, in which case the empty list is returned, because it is already sorted. If the set is nonempty, our sorted array will be found by recursively calling sort on the smaller and larger lists.

The `where` keyword, on line 3, is placed directly after a statement that needs to use the definitions stated in the where block. A `where` block provides local definitions so that there is no danger of ambiguity or reusing names in other parts of the program. On line 3 the set of all `xs` in the set which are less than or equal to the pivot is sorted recursively until we reach the case of an empty set, and then we retrace our steps through the recursion and concatenate the smaller lists to the larger lists. We do the recursion and evaluation in this order because Haskell is a lazy language, and will not evaluate statements until it is absolutely necessary. Haskell will save a list of functions to evaluate and not until it reaches the end of the recursive loop will it finally go back and find the outcomes resulting from each pass.

```1 sort :: (Ord a) => [a] -> [a]
2 sort (x:xs) = smaller ++ x:larger
3   where smaller = sort [y | y <- xs, y <  x]
4     	  larger = sort [y | y <- xs, y >= x]
5 sort _ = []
```
##### Parallel Quicksort

With the addition of three functions the sequential Quicksort can be written in parallel. The main difference is that the code takes advantage of the `par` and `pseq` functions, which can be found in the `Control.Parallel` modules. The `par` function takes the argument to its left and begins working on it in parallel, while moving on to the right argument. The `pseq` function is almost the opposite of `par`. When we call `pseq`, the function requires that its left argument finish before it allows the right argument to execute. The force function simply prevents lazy evaluation, so that the left argument is evaluated eagerly.

```1  import Control.Parallel
2  parSort :: (Ord a) => [a] -> [a]
3  parSort (x:xs) =
force larger `par`
(force smaller `pseq`
(smaller ++ x:larger))
4    where smaller =
parSort [y | y <- xs, y <  x]
5          larger =
parSort [y | y <- xs, y >= x]
6  parSort _ = []
7  force :: [a] -> ()
8  force xs = go xs `pseq` ()
9  where go (_:xs) = go xs
10       go [] = 1
```

#### Riemann Integral

For demonstrating some of the features of Haskell, an instructor could show a simple program that will estimate pi by approximating the area under half of the unit circle and multiplying it by two. This has been a popular introduction to parallelism and has been used in the ACM Parallel Computing TechPack. Our code could be a nice addition to this previous educational work, which has focused on more traditional parallel programming languages.

These programs are good class examples because they utilize the list features of Haskell, they are easy to relate to, and the transition to parallel code is fairly straightforward.

##### Sequential Riemann

The program will take an argument for the number of partitions and return an estimation of pi. It will do this by the method of right-handed Riemann rectangle summation. To implement this sum we do the following. First we create a list that has an appropriate dx based on the number of partitions the user inputs. We then multiply dx by twice the height of the right hand point to get the area of our rectangle. We then add up all of the area of the rectangles to get our approximation.

```1 -- Calculate upper hemisphere
of the unit circle
2 circle :: Double -> Double
3 circle x = sqrt (abs(1 - x^2))
4 -- Calculate the area of
right Riemann rectangle
5 area :: Double -> Double -> Double
6 area x1 x2 = (x2 - x1) * circle x2
7 -- Recursively add the areas
of the Riemann rectangles
8 estimate (x:[]) = 0
9 estimate (x:y:xs) =
(area x y) + estimate (y:xs)
```
##### Parallel Riemann

The parallel version is almost identical code, using a similar recursive function to add the areas of the Riemann rectangles. The primary difference comes from the insertion of the `par` and `pseq` functions. In our parallel estimation of pi, `par` is calculating smaller, and `pseq` is calculating larger at the same time. Larger makes a recursive call to `parEstimate`, giving us another smaller section to begin executing in parallel. This essentially gives us a cascading sum of parallel computations of the areas of the Riemann rectangles. Once larger --with the recursive smallers-- is finally complete, larger and smaller are added together, Resulting in pi.

```1  import Control.Parallel
2  -- Calculate upper hemisphere
of the unit circle
3  circle :: Double -> Double
4  circle x = sqrt (abs(1 - x^2))
5  -- Calculate the area of
right Riemann rectangle
6  area :: Double -> Double -> Double
7  area x1 x2 = (x2 - x1) * circle x2
8  -- Recursively add the areas
of the Riemann rectangles
9  parEstimate :: [Double] -> Double
10 parEstimate (x:[]) = 0
11 parEstimate (x:y:[]) = area x y
12 parEstimate (x:y:xs) =
13 smaller `par`
(larger `pseq` smaller + larger)
14   where smaller = area x y
15         larger = parEstimate (y:xs)
```

## JavaScript/RiverTrail

After Haskell, we explored the possibilities of parallelizing web applications by using Intel's parallel programming extension to JavaScript, River Trail, to parallelize a flocking algorithm that uses emergent behavior, called boids. River Trail introduces a single new data type, the `ParallelArray`, and six higher order functions that interact with it. In the code above, we use the function map in order to draw each of our boids object in parallel, greatly improving graphical computational efficiency.

#### Reynolds' Boids

The boids simulation was first created by Craig Reynolds in 1986. It simulates the basic flocking motion of animals such as birds. There are three central rules which guide the boids' movements. Each rule outputs a vector in a certain direction, and after each rule is calculated, the vectors are summed and the boid is moved in the direction of the summed vector. Here are the three rules:

• Cohesion: All boids tend towards the average position of all other boids in a neighboring vicinity.
• Separation: Boids must steer to avoid collisions with other boids nearby.
• Alignment: Boids steer toward the average direction of the surrounding boids.

The program then adds up all the vectors that result from the rules and adjust the boid’s current position and velocity accordingly. This program can be parallelized because at each time step, we can calculate the rules for each boid independently of the others’. Instead of sequentially traversing the boids to calculate their new positions, we can calculate the rules for each boid in parallel. This can cause a significant speedup if done correctly, and students would be able to observe the effects of parallelization.

We wrote the entire simulation in JavaScript in a little over a week, and then proceeded to find ways to parallelize certain actions in the simulation. We were able to successfully parallelize the drawing of the boids.

The boids simulation was programmed in the JavaScript programming language. JavaScript is the scripting language most commonly used on the Internet. The syntax is similar to C and Java, so students should be able to quickly learn how to code in it. Its multiparadigm design allows the programmer to select a functional or object-oriented programming style. River Trail consists of one new data type, `ParallelArray`, and six higher-order functions (`combine`, `filter`, `map`, `reduce`, `scan`, and `scatter`) used to manipulate `ParallelArray`. These higher-order functions act as any functional programmer would expect. The functions are carried out in parallel, without the programmer having to specify low-level settings, like the number of threads or order of execution. This abstraction makes the language readily accessible to new parallel programmers. `ParallelArray`s are immutable, so in order to manipulate them, you must create new (or “freshly minted”) `ParallelArray`s based on the existing ones in order to change the data. Although they may be multidimensional, `ParallelArray`s must be rectangular (or uniform) – meaning each row has the same number of columns. Every `ParallelArray` has a shape which describes the dimensionality and size of the array. For example a 4x5 `ParallelArray` has the shape. To parallelize a program, the programmer must define a `ParallelArray` and the data it contains, the method that will determine the parallel processing pattern, and an elemental function that is applied to the data in parallel. The function we were able to parallelize was the draw function which is called every `step`. Instead of sequentially going through every boid to draw each according to their position, we were able to draw the boids in parallel. Here is the code we wrote:

```
var boids = new ParallelArray(boids);
var draw = boids.map(function draw(boid) { ... });
```
We create the `ParallelArray` of boids to manipulate, and then call the anonymous function draw on each boid that is executed in parallel. We also attempted to parallelize the `step` function which carries out the flocking algorithms. The `combine` function is well-suited for this, but due to time constraints and an error message stating “`combine is not a function`,” we were unable to implement this method.

River Trail is not the only JavaScript parallel API.We also explored web workers (another JavaScript parallel API), but we decided against using it because it is much more low-level, and it involves message passing. The programmer must write explicit commands and think at a lower level to facilitate communication between the web workers. Furthermore, each web worker must be in charge of its own JavaScript file, which would not be easy for the boids program. These are the main reasons why River Trail is a better option for beginning programmers. The choice between River Trail and web workers is a tradeoff between simplicity and flexibility.

There are many ways to improve this simulation, and numerous unexplored paths to follow. The boids simulation has been implemented with a variety of additions, including 3D effects, predator and prey models, and other biological phenomenons. All of these simulations could be further improved with parallelization. Due to certain time constraints, we were un able to analyze performance of our program and compare with the sequential version, but a performance comparison would be very valuable to this project as well. Furthermore, other parallel APIs do exist for JavaScript, such as web workers. It would be interesting to explore the similarities and differences between the two parallel APIs.

## CUDA-C

Finally, we looked into the CUDA-C programming language. CUDA is a parallel computing architecture that takes advantage of the processing power of Graphics processing units (GPUs). CUDA uses the central processing unit (CPU) and GPU to accelerate computing. Serial code is run on the CPU while parallel code is run on the GPU. We chose to parallelize the Needleman-Wunsch algorithm using CUDA.

Needleman-Wunsch is a bioinformatics algorithm for comparing DNA sequences using dynamic programming. Given two sequences of DNA, say GGAATTA and CGGAATT, we want to compare the sequences, inserting blanks when necessary, in order to obtain an optimal alignment. In this example, the best alignment is _GGAATTA and CGGAATT_. The algorithm finds this optimal alignment by creating a scoring matrix, which evaluates all possible partial alignments, and then performing a traceback in order to find the best one. We start by defining a scoring function f such that,

f(a, a) = 1
f(a, b) = 0
f(a, _) = f(_, b) = -1

Essentially, we want to increment our score if two identical letters align, keep our score the same if mismatch letters align, and then decrement our score if a blank is used. Using this function, we can see that the score in our optimal alignment above is 4.

Now, we line up the two sequences on vertical and horizontal axes, and then fill out our table, searching for an optimal alignment. The Needleman-Wunsch algorithm fills out each cell thusly:

T(i, j) = max{T(i-1, j) + f(a, _), T(i-1, j-1) + f(a, b), T(i, j-1) + f(_, b)}

Where i is the row, j is the column, a is the letter on the ith row and b is the letter on the jth column.

The challenge of parallelizing the process of completing the scoring matrix has been addressed in a number of ways. We implemented two of them. We first wrote a parallelized version with one of the more popular methods, which evaluates diagonal rows, from southwest to northeast, in parallel. Notice that each cell is computed by referencing its north, west, and northwest neighbors. By filling out the table diagonal by diagonal, we can ensure that each cell has the three necessary neighbors already completed when it comes time to evaluate it. Furthermore, each cell in a diagonal can be computed independently from one another, and therefore the table can be filled out in parallel. When a diagonal is being computed, each cell is assigned to a thread, and thus the computation of a diagonal should take constant time.

We also implemented a method that actually allows us to fill out the table row by row. We used a parallel prefixing algorithm to fill out an entire row in parallel, given the completion of the row above it. Parallel prefix is a method of calculating partial and total values for any group of comparable objects bound by a binary associative operator. The most common example involves calculating partial sums for a sequence of summands. At each step, processors communicate with each other and send their total values. If the total values are sent “left,” they adjust the recipients’ total values. If they are sent “right,” they adjust the recipients’ total and partial values. The example below uses the maximum function as the binary operator. X[i] stores the partial values, and max store the total values. Now, how do we use this method in order to calculate a row of our table in parallel? It takes four basic steps, all of which can be done in parallel. First, we need to calculate a couple of values for each cell:

Define w[j] = max{T(i-1, j) + f(a, _), T(i-1, j-1) + f(a, b)}. Notice that these are the north and northwest values that T(i, j) considers when evaluating. Since we have already computed the previous row, these w[j] values are known.

Next, define z[j] = w[j] + j. This seems somewhat arbitrary, but we will use this in the next couple of lines. Notice that z[j] values are also known.

Now, define x[j] = T(i, j) + j, which can be rewritten as x[j] = max{w[j], T(i, j-1) + f(_, b)} + j = max{w[j] + j, T(i, j-1) -1 + j} = max{z[j], T(i, j-1) + (j-1)} = max{z[j], x[j-1]}. Notice then, that an x[j] value is really the max of all previous x[j] values, as well as any z[j] values from previous columns. Looking back at our prefixing example above, we see that, starting with each cell’s z[j] value, and treating x[j] values as the partially evaluated terms, we can calculate every cell’s x[j] value in parallel, which finishes in 0(log(j)) time.

Finally, recall that x[j] = T(i, j) + j, so T(i, j) = x[j] - j. Since all of the x[j] values are now known, we can calculate the entire row of T(i, j) values in parallel.

Both the diagonal and the prefixing algorithms yielded impressive results in terms of runtime. We hoped that we could implement our prefix version in such a way that it could outperform the diagonal evaluation, but the diagonal implementation appeared to perform better, especially with a sequence of larger size.

#### Results

We ran simulations of the diagonal and prefix programs, varying the number of threads per block. We graphed average duration and performance on a GeForce GTX-460. A 1000 character sequence on 336 cores, diagonals outperformed prefix, contrary to other predictions. However, the margin of difference is small and possibly inconclusive. To find conclusive results, we would like to test a much larger sequence on a better graphics card. Optimization of our code would also be necessary. 