Parallel Programming EducationLewis and Clark College, Portland, OR-- Summer 2013. Kyle Barton, Sam Dodson, Danielle Fenske, Ben Whitehead, Professor Jens MacheHaskellDuring our ten weeks of summer research, we studied three different parallel-capable programming languages in search of the language that is most conducive to teaching parallel concepts at the undergraduate level. We first chose to research a functional programming language, Haskell. Haskell is a purely functional, deterministic programming language. It has a static, strong, type system with automatic type inference. The upshot is that the output of its functions will always be the same regardless of whether it is running on one or multiple cores. This eliminates deadlock and race condition errors that can plague traditional approaches to and parallel programming. Additionally, the abstraction in the parallel functions in Haskell allows for the programs to run on a range of hardware. We wrote a simple parallel Haskell assignment to be used in the classroom. We have submitted the entire teaching module to the National Science Foundation funded website, CSinParallel, and we hope to have it posted on the site soon. Quicksort in HaskellAs an introduction to the Haskell programming language, below are sequential and parallel implementations of the Quicksort algorithm. For the readers not familiar with this sorting algorithm, a description of it follows. The algorithm sorts a list of integers. An element from the list is chosen to be the pivot value. A sublist of all the elements less than the pivot is created, and recursively sorted. A sublist of all the elements greater than or equal to the pivot is also created, and recursively sorted. Then the two sorted sublists are appended. Sequential QuicksortThe sequential version of Quicksort is seen as one of the best examples of the elegance of the Haskell programming language, because implementations of the Quicksort algorithm in other programming languages, such as C, C++ or Java, are usually much longer. Starting at line 2, the code sequentially executes, going through each case until it reaches one that applies to the given situation. The first guard, on line 2, describes the case of a non-empty list. The last case, on line 5, checks for an empty list, in which case the empty list is returned, because it is already sorted. If the set is nonempty, our sorted array will be found by recursively calling sort on the smaller and larger lists. The 1 sort :: (Ord a) => [a] -> [a] 2 sort (x:xs) = smaller ++ x:larger 3 where smaller = sort [y | y <- xs, y < x] 4 larger = sort [y | y <- xs, y >= x] 5 sort _ = [] Parallel Quicksort With the addition of three functions the sequential Quicksort can be written in parallel. The main difference is that the code takes advantage of the 1 import Control.Parallel 2 parSort :: (Ord a) => [a] -> [a] 3 parSort (x:xs) = force larger `par` (force smaller `pseq` (smaller ++ x:larger)) 4 where smaller = parSort [y | y <- xs, y < x] 5 larger = parSort [y | y <- xs, y >= x] 6 parSort _ = [] 7 force :: [a] -> () 8 force xs = go xs `pseq` () 9 where go (_:xs) = go xs 10 go [] = 1 Riemann IntegralFor demonstrating some of the features of Haskell, an instructor could show a simple program that will estimate pi by approximating the area under half of the unit circle and multiplying it by two. This has been a popular introduction to parallelism and has been used in the ACM Parallel Computing TechPack. Our code could be a nice addition to this previous educational work, which has focused on more traditional parallel programming languages. These programs are good class examples because they utilize the list features of Haskell, they are easy to relate to, and the transition to parallel code is fairly straightforward. Sequential RiemannThe program will take an argument for the number of partitions and return an estimation of pi. It will do this by the method of right-handed Riemann rectangle summation. To implement this sum we do the following. First we create a list that has an appropriate dx based on the number of partitions the user inputs. We then multiply dx by twice the height of the right hand point to get the area of our rectangle. We then add up all of the area of the rectangles to get our approximation. 1 -- Calculate upper hemisphere of the unit circle 2 circle :: Double -> Double 3 circle x = sqrt (abs(1 - x^2)) 4 -- Calculate the area of right Riemann rectangle 5 area :: Double -> Double -> Double 6 area x1 x2 = (x2 - x1) * circle x2 7 -- Recursively add the areas of the Riemann rectangles 8 estimate (x:[]) = 0 9 estimate (x:y:xs) = (area x y) + estimate (y:xs) Parallel RiemannThe parallel version is almost identical code, using a similar recursive function to add the areas of the Riemann rectangles. The primary difference comes from the insertion of the 1 import Control.Parallel 2 -- Calculate upper hemisphere of the unit circle 3 circle :: Double -> Double 4 circle x = sqrt (abs(1 - x^2)) 5 -- Calculate the area of right Riemann rectangle 6 area :: Double -> Double -> Double 7 area x1 x2 = (x2 - x1) * circle x2 8 -- Recursively add the areas of the Riemann rectangles 9 parEstimate :: [Double] -> Double 10 parEstimate (x:[]) = 0 11 parEstimate (x:y:[]) = area x y 12 parEstimate (x:y:xs) = 13 smaller `par` (larger `pseq` smaller + larger) 14 where smaller = area x y 15 larger = parEstimate (y:xs) ResourcesJavaScript/RiverTrailAfter Haskell, we explored the possibilities of parallelizing web applications by using Intel's parallel programming extension to JavaScript, River Trail, to parallelize a flocking algorithm that uses emergent behavior, called boids. River Trail introduces a single new data type, the Reynolds' BoidsThe boids simulation was first created by Craig Reynolds in 1986. It simulates the basic flocking motion of animals such as birds. There are three central rules which guide the boids' movements. Each rule outputs a vector in a certain direction, and after each rule is calculated, the vectors are summed and the boid is moved in the direction of the summed vector. Here are the three rules:
The program then adds up all the vectors that result from the rules and adjust the boid’s current position and velocity accordingly. This program can be parallelized because at each time step, we can calculate the rules for each boid independently of the others’. Instead of sequentially traversing the boids to calculate their new positions, we can calculate the rules for each boid in parallel. This can cause a significant speedup if done correctly, and students would be able to observe the effects of parallelization. We wrote the entire simulation in JavaScript in a little over a week, and then proceeded to find ways to parallelize certain actions in the simulation. We were able to successfully parallelize the drawing of the boids. The boids simulation was programmed in the JavaScript programming language. JavaScript is the scripting language most commonly used on the Internet. The syntax is similar to C and Java, so students should be able to quickly learn how to code in it. Its multiparadigm design allows the programmer to select a functional or object-oriented programming style. River Trail consists of one new data type, var boids = new ParallelArray(boids); var draw = boids.map(function draw(boid) { ... });We create the ParallelArray of boids to manipulate, and then call the anonymous function draw on each boid that is executed in parallel. We also attempted to parallelize the step function which carries out the flocking algorithms. The combine function is well-suited for this, but due to time constraints and an error message stating “combine is not a function ,” we were unable to implement this method.
River Trail is not the only JavaScript parallel API.We also explored web workers (another JavaScript parallel API), but we decided against using it because it is much more low-level, and it involves message passing. The programmer must write explicit commands and think at a lower level to facilitate communication between the web workers. Furthermore, each web worker must be in charge of its own JavaScript file, which would not be easy for the boids program. These are the main reasons why River Trail is a better option for beginning programmers. The choice between River Trail and web workers is a tradeoff between simplicity and flexibility. ResourcesCUDA-CFinally, we looked into the CUDA-C programming language. CUDA is a parallel computing architecture that takes advantage of the processing power of Graphics processing units (GPUs). CUDA uses the central processing unit (CPU) and GPU to accelerate computing. Serial code is run on the CPU while parallel code is run on the GPU. We chose to parallelize the Needleman-Wunsch algorithm using CUDA. Needleman-Wunsch is a bioinformatics algorithm for comparing DNA sequences using dynamic programming. Given two sequences of DNA, say GGAATTA and CGGAATT, we want to compare the sequences, inserting blanks when necessary, in order to obtain an optimal alignment. In this example, the best alignment is _GGAATTA and CGGAATT_. The algorithm finds this optimal alignment by creating a scoring matrix, which evaluates all possible partial alignments, and then performing a traceback in order to find the best one. We start by defining a scoring function f such that,
f(a, b) = 0 f(a, _) = f(_, b) = -1 Essentially, we want to increment our score if two identical letters align, keep our score the same if mismatch letters align, and then decrement our score if a blank is used. Using this function, we can see that the score in our optimal alignment above is 4. Now, we line up the two sequences on vertical and horizontal axes, and then fill out our table, searching for an optimal alignment. The Needleman-Wunsch algorithm fills out each cell thusly: The challenge of parallelizing the process of completing the scoring matrix has been addressed in a number of ways. We implemented two of them. We first wrote a parallelized version with one of the more popular methods, which evaluates diagonal rows, from southwest to northeast, in parallel. Notice that each cell is computed by referencing its north, west, and northwest neighbors. By filling out the table diagonal by diagonal, we can ensure that each cell has the three necessary neighbors already completed when it comes time to evaluate it. Furthermore, each cell in a diagonal can be computed independently from one another, and therefore the table can be filled out in parallel. When a diagonal is being computed, each cell is assigned to a thread, and thus the computation of a diagonal should take constant time. We also implemented a method that actually allows us to fill out the table row by row. We used a parallel prefixing algorithm to fill out an entire row in parallel, given the completion of the row above it. Parallel prefix is a method of calculating partial and total values for any group of comparable objects bound by a binary associative operator. The most common example involves calculating partial sums for a sequence of summands. At each step, processors communicate with each other and send their total values. If the total values are sent “left,” they adjust the recipients’ total values. If they are sent “right,” they adjust the recipients’ total and partial values. The example below uses the maximum function as the binary operator. X[i] stores the partial values, and max store the total values. Now, how do we use this method in order to calculate a row of our table in parallel? It takes four basic steps, all of which can be done in parallel. First, we need to calculate a couple of values for each cell: Define w[j] = max{T(i-1, j) + f(a, _), T(i-1, j-1) + f(a, b)}. Notice that these are the north and northwest values that T(i, j) considers when evaluating. Since we have already computed the previous row, these w[j] values are known. Next, define z[j] = w[j] + j. This seems somewhat arbitrary, but we will use this in the next couple of lines. Notice that z[j] values are also known. Now, define x[j] = T(i, j) + j, which can be rewritten as x[j] = max{w[j], T(i, j-1) + f(_, b)} + j = max{w[j] + j, T(i, j-1) -1 + j} = max{z[j], T(i, j-1) + (j-1)} = max{z[j], x[j-1]}. Notice then, that an x[j] value is really the max of all previous x[j] values, as well as any z[j] values from previous columns. Looking back at our prefixing example above, we see that, starting with each cell’s z[j] value, and treating x[j] values as the partially evaluated terms, we can calculate every cell’s x[j] value in parallel, which finishes in 0(log(j)) time. Finally, recall that x[j] = T(i, j) + j, so T(i, j) = x[j] - j. Since all of the x[j] values are now known, we can calculate the entire row of T(i, j) values in parallel. Both the diagonal and the prefixing algorithms yielded impressive results in terms of runtime. We hoped that we could implement our prefix version in such a way that it could outperform the diagonal evaluation, but the diagonal implementation appeared to perform better, especially with a sequence of larger size. ResultsWe ran simulations of the diagonal and prefix programs, varying the number of threads per block. We graphed average duration and performance on a GeForce GTX-460. A 1000 character sequence on 336 cores, diagonals outperformed prefix, contrary to other predictions. However, the margin of difference is small and possibly inconclusive. To find conclusive results, we would like to test a much larger sequence on a better graphics card. Optimization of our code would also be necessary. Resources |