Final Report

Parallelizing a Kakuro Solver Over Different Parallelization Frameworks

Summary:

We implemented a Kakuro-solver sequentially and in parallel using CUDA, OpenMP, and OpenMPI frameworks. We ran our CUDA implementation on the GHC machines, while we ran our OpenMP and OpenMPI implementations on the Latedays machines. We compared the performance of our initial implementations using these frameworks and tried to come up with algorithms that were better suited for each framework and their corresponding abstraction models. This allowed us to explore the complexity of solving an NP-complete constraint satisfaction problem over different parallelization frameworks, and understand what parts of the problem could be effectively communicated to achieve speedup, and which parts created bottlenecks for the different frameworks.


Background:

Kakuro is a Sudoku-esque puzzle where the objective is to fill a semi-complete grid with distinct numbers from 1-9, given the values that each row, column, and main diagonal should sum up to.

In order to accommodate larger board sizes for testing and comparisons, we modified this original game in the following way: the board can be filled with non-distinct numbers from 0-9, and each board is a complete NxN grid.

As demonstrated in the figure, the algorithm’s inputs are an NxN grid, filled with either values 0-9 or blank values represented by -1, an 1xN array of row sum values that represent the sum of each row in the grid, an 1xN array of column sum values, and a 1x2 array of diagonal sum values that represent the top-left-to-bottom-right diagonal sum and the top-right-to-bottom-left diagonal sum, respectively. The algorithm’s output is the completed board, without any -1s, that satisfies all the constraints (all the row, column, and diagonal sums), and is only filled with values 0-9.

The four input arrays make up the key data structures for all of the implementations. There are also a few additional key data structures depending on the implementation. For example, for the sequential and OpenMP implementations, we maintain a list of range values (represented as a tuple) for the unset values in the grid, coupled by a list of row-column pairs that represent the locations of unset values. On the other hand, for the CUDA implementation, we maintain an array of multiple NxN boards that are partially completed (aka that have some of the unset values filled in) that we traverse to backtrack and find the solution. In addition, we maintain arrays of information regarding the location of empty spaces in each board in the array of boards and the number of empty spaces for each board. In OpenMPI, we use the same input structures as we do in our sequential and OpenMP implementations. However, we organize that information as public (containing non-modifiable data like column sums, row sums, diagonal sums and size) or “private” (data that is independently made available to each thread such as the grid). We don't use a pruning algorithm for the OpenMPI implementation and instead just try values using brute-force so we don’t have to declare the same structures as the sequential and OpenMP implementations.

As such, the operations on these data structures depends on the implementation as well. The four input arrays are the only ones used in the same way for all implementations -- none modify the sum arrays, and all of them modify the input board array by setting different values for the spaces initially marked with -1s. For the sequential and OpenMP implementations, the range values list is updated with constrained values as we iteratively prune the possible values that a given unset space could take. We initially populate it with values 0-9. The only modifications we make to our row-column list is to remove entries from it, as we deterministically set values for the originally unset spaces. For the CUDA implementation, we add a board to the array of boards whenever we set a value for an unset space (we create a new board with that value set and add it) and we remove a board from the array whenever we deem it to be invalid. When we do this, we also update the arrays for empty spaces based on the number of unset spaces left and their locations. Finally, when we backtrack, we modify the values of previously unset spaces in the boards in the array of boards. For OpenMPI, we simply set the value in an independent copy of the board, and proceed to try the next value. Our OpenMPI does independent trials and only shares data when a result has been found.

In this problem, we search for a completed grid whose row, column and diagonal values match the given sums. Our version of Kakuro is still a constraint satisfaction problem, where the constraints are the sums, and the values to be found are the values to the empty spaces. To find a solution, we need to build up the solution by setting the individual values and then continuing our search with the new information which we have assumed. Thus, the solution space that we need to explore is vast, and this is where the parallelism aspect comes in. With such a problem, parallelism helps in exploring multiple solution spaces at the same time. So, our approaches work towards intelligently exploring all the values possible, and using parallelism to do different work at the same time.

There are several ways to do this - checking different solution boards in parallel, looking at independant spaces within a single board and keeping track of the information that changes every time a value changes and propagating that forward. In terms of constraint satisfaction, this means that we find the solution in a series of “steps”, where each step is simply a value that we assume and try solutions for. We stop once we have found a solution by following the trail of valid assumptions. For example, looking at Fig.1 we can see that the values at grid[1][0], and grid[1][2] are between [0,7] according to the required row sum, between [0,2] and [0,9] respectively according to the column sums. For grid[1][1], the diagonal limits the value to be [7]. We realise that this value is a singleton, and set it in the board, which means that we have to propagate the value to update the ranges that we have set for all the other blank cells. This propagation of ranges and setting of a value, is one step. We repeat this process over and over again. This propagation is one of the smaller tasks that we are trying to parallelize.

Approach:

Sequential:

Our initial idea for our sequential implementation was to try a pure brute force approach -- namely, for each unset value in the Kakuro board, trying possibilities 0-9 with every possible combination of values 0-9 for the other unset values, and accepting whenever a board was completely filled with values that satisfied all the row, column, and diagonal constraints. Since any unset value could take on values 0-9, even if another value in the same row/column/diagonal shared that value, there would be 10^{number of unset spaces} possible combinations of values to try out. While this would be fine for smaller boards, with few unset spaces, this problem clearly becomes exponentially harder as the number of unset spaces increases (which generally increases with board size increases).

As a result, we decided to add an initial pruning of values for the unset spaces -- based on the other values set in the row, column, and diagonal for a given unset space, and the sum values for that row, column, and (possible) diagonal, we could greatly reduce the range of values from always being 10 values to generally being closer to 2-3 values (note: for boards with very few spaces, the problem of solving the Kakuro board therefore becomes trivially easy, because many of the spaces only have one possible value, since there are few unknowns across the board. While the range of values is generally larger for boards with a greater number of unset spaces, there is still a decent constraint of values).

After implementing this initial prune, we further realized that we could iteratively prune our range values further and further as we set values and constrained ranges for other originally unset spaces, which would be far less costly than trying out and testing the validity of the board with values that couldn’t work. As such, we added another layer of pruning every time we were able to set or constrain another value further. This reduces the solution space, without actually having to do the work of sending the board to our validation function and have that function tell us whether a given value is possible or not for every value. Thus, our sequential algorithm works by limiting the ranges of the values possible in the other spaces when we try out a particular value. This final improvement in our sequential version allowed for a largely efficient Kakuro-solving algorithm, solving most reasonable boards in a matter of milliseconds.

We implemented our sequential version completely in C++, using standard C++ libraries, such as the set, map, utility, and string libraries. For our sequential version, we did not specifically target a machine to run it on; rather, we focused on the best ways to reduce the search space for the board, and to most efficiently compute the validity of a board, in order to find and output the right answer.


OpenMP:

Given that OpenMP is a shared address space model, we attempted to design our approaches based around what data-sharing our pruning algorithm would benefit the most from. Besides the obvious, unchanging values of the sums of the rows, columns and diagonals, we also shared the board to allow for more independent exploration. In our first approach to solving this problem, we simply identified the parts of our sequential code, that were independently parallelizable and attempted to parallelize that. For example, updating the ranges in parallel for a board while pruning, validity checking and making recursive calls to our pruning function. We tried this basic implementation over an increasing number of processes. We tried sharing boards for every new trial of the board but that lead to concurrency issues with updating the value in the same copy of the board.

Our solution to that problem was to make new copies of the board everytime we handed it to the pruning function (limits the ranges based on new information). However, that was inefficient in terms of space utilization and time because of the sheer amount of memory allocated over the number of solutions tried for larger boards (e.g. 128 x 128). Furthermore, copying data and allocating memory adds unnecessary computation time, that we also believe made the speedup negligible. Here, we realised that the biggest sink in our resources was caused by the repeated nature of our deep recursive calls.

Our next approach uses the same algorithm, but tasks it to try out new values based on spaces that are independant at the same time, and update the set ranges for a given board at the same time. We thought that this would give us better work-distribution among our processors and allow us more control over parallel nature of the program. We thought that the control over the threads would allow us parallelize the recursion steps better. As promising as this approach sounds, we realised that the task distribution was unequal and therefore, wasted time by stalling other processes in order to combine information from the other processes together and therefore, did not contribute to any actual speedup. Another drawback of this approach is that it forces parallelism even when the next steps are too small to lend any value to parallelising them. The recursive nature of the problem makes the work distribution complex.

This implementation uses C++ and the OpenMP framework. Like our sequential version, this uses the set, map, utility and string libraries too. Given the varied nature of our tasks, we used the Latedays cluster to test our programs - a CPU would be better suited to the parallelism in this case than a GPU.


CUDA:

We originally tried adapting our final sequential implementation with pruning into a CUDA implementation, by partitioning the board into segments / chunks, and having each block attempt to solve the unset spaces within each block. Since the empty spaces in the board were always randomly chosen, there was unlikely to be a large concentration of unset spaces in any given chunk, but sections that did have more unset spaces were more complicated problems to solve, because there were more possible value combinations that need to be tried, which would result in uneven work distribution and waiting between blocks. While we did attempt this solution, we quickly realized that it would not result in any desired speedups / results because a large problem with this implementation was that none of the block could benefit from the results of the other sections (namely, if the correct value / a reduced range was found for an unset space outside of a given block, then the block couldn’t utilize / know this information and use it for its own solution). Additionally, without looking at the board as a whole, it was too complicated to coordinate information once each block returned with its findings -- if blocks came back with conflicting values (e.g. if two different blocks set values for two empty spaces in a row, but the corresponding sum of the row with those values did not equal the given row sum), then resolving the conflicts or sharing updated information would incur large overheads that the parallelism could not outweigh.

As a result, we decided to attempt a solution different to the strategy used for the sequential implementation, seeing as the construction of the sequential version was not amenable to good / sensible work distribution. For our final CUDA implementation, we attempted a backtracking strategy; we would maintain a list of possible boards that we would generate by creating boards with values set for the first unset space found, then creating boards based on the the next unset space found and with the first value set, and so on (up to a certain number of spaces filled / iterations). At each step, different blocks would be responsible for evaluating a part of this list of boards and checking the validity of different boards. We also added some initial pruning of these values, to restrict the search space, and another level of pruning based on the validity checks of the generated boards. In this way, the array of NxN boards mapped to blocks and threads within blocks, because each block, as stated above, would get a chunk of boards within the array, and each thread in each block could potentially evaluate a different board. Finally, this process would result in a list of possible boards, and different threads of the GPU would attempt to solve each of these boards, such that if a solution were found with all the unset spaces set and a valid board, the search would terminate and the final correct board would be returned. This second step is the backtracking step because if the board ever becomes invalid / becomes filled without correctly satisfying all of the constraints, we undo our last guess, and try to set a correct value in its place. If none of these values work, we backtrack further, until we reach the first empty space, in which case we know there isn’t a valid solution.

We implemented this in C++ and CUDA. We used the following C++ APIs: cstdio, cstdlib, cmath, cstring, fstream, algorithm. We also used the cuda_runtime CUDA API. We targeted the GHC machines (aka NVIDIA GeForce GTX 1080 GPUs), with CUDA compute capability 6.1. Finally, the inspiration / foundational idea for our final algorithm came from an adaptation of a Sudoku solver in CUDA (“Parallelized Sudoku Solver on the GPU”, as listed in the references section). This code was semi-used as a starting point for our algorithm, but was heavily modified to account for the different constraints of Kakuro vs. Sudoku (e.g. Kakuro has row / column / diagonal sums that must be maintained) and to account for the far greater search space that a Kakuro-solver must handle (since Kakuro can have far larger board sizes, has a slightly bigger possible range of values for a given unset space, and can have non-distinct values in the same row/column/diagonal).


OpenMPI:

Thinking about this approach was harder in terms of what information we wanted to handout to our processes and what we wanted to collect back from them. So, the first algorithm that we tried was simply to hand in a value and location in the board, the board itself and all the sum values to the different threads, and letting them solve independently from there. Here the communication between the processes is minimal because they perform independent work. The only time they communicate is when a board is found, so that the other processes can be killed.

However, with this approach the drawback is that if the location has fewer possible values that can be set as compared to the number of threads, we waste the resources available to us by not utilizing them. The challenge with this framework was to use a fixed number of available processes and tree-recursive problem whose size increases exponentially at every level. This framework requires a model that allows for maximum independent solving. Our next idea was to divide the board and share the independent space exploration like we did in our OpenMP implementation, but we realized that the communication cost of sending large boards and then merging the results was too much to be able to improve speedup. The processes communicate at every level i.e., when they try a value for a given empty space. Another approach that we would have liked to try, is to ‘queue’ up the boards as the threads modify and add values to the spaces, and then have the threads pick up a board from the queue and verify if it works in its given state. The same problem arises here where the frequency of communication is higher, and the size of the messages is large.For instance, our message would be comprised of a board with a value, and then receiving a solved board back as a message to check if it was solved or not. This would be repeated over and over again, and recursively too. Thus, this already contributes to a significant communication and synchronization cost, given that it is done at every step and from a slave to a master.

Results:

We measured and compared performance between implementations based on wall-clock time of outputting a solution given the same input. We did not include initialization time in our measurements, however, because none of the implementations were initialization time-heavy, so it only trivially contributed to each implementation’s overall time.

We had varying input board sizes, but our main focus was on 8x8 and 16x16 boards. We found that the complexity of the problem varied more based on the number of unset spaces in our input board than in varying the board size itself, so we mostly stuck to those size boards. We did vary board sizes (8x8, 16x16, 32x32, 64x64) for our comparison of sequential vs. OpenMP implementations, but since those implementations were very similar, we did this to achieve more interesting results than the smaller board sizes yielded. Another reason for sticking with smaller board sizes is are the CUDA and OpenMPI versions, where the large number of boards being passed around to achieve parallelization, the distribution of work amongst threads and larger size boards presented memory problems. Another thing to note is that we did put a soft limit on the number of blanks that we had within a given sized board. This is because past a certain threshold, the problem became far too computationally complex (having to traverse nearly the entire exponential search space), and none of the implementations performed well (all ran for greater than 15 minutes). We generated our test input boards, based on the above considerations, via a Python script we wrote that randomly generated a board and row, column, and diagonal sum values given an input size. The script assigned -1s to spaces in the board based on a probability that we specified, and that we adjusted in order to vary the level of difficulty for the boards.

We used the GHC and Latedays machines to run our code. For our parallel implementations, we set up our code to take in arguments to specify number of threads (for OpenMP), number of blocks and threads (for CUDA), and number of processors (for OpenMPI). CUDA was run on the GHC machines, while OpenMP and OpenMPI were run on the Latedays machines. We chose these machines based on the nature of our frameworks. For instance, OpenMP and OpenMPI frameworks offer tools that are better suited to a CPU to perform larger variable tasks while CUDA is better suited to a GPU because of the smaller, bite-sized chunking allowed by the it thread-block structure. Part of our project was to develop implementations that would take into account the strengths of the different frameworks, so given the machines readily available to us, we used the GPU (GHC machines), and CPU (Latedays machines) for the frameworks that were best suited to them.

In general, the complexity of our problem comes from the fact that the size of our problem increases exponentially at every level. Furthermore, because it is a sum-based problem and not a uniqueness problem (Sudoku), we need all the contextual information to line up with each other, and the effects of setting a value in the board span over the entire board more or less. This makes sharing of information complicated, and costly. Besides that, we cannot approximate the solution or probabilistically find a solution like we did in Assignments 3 and 4. This is not surprising given that, like Sudoku, this problem is also NP-complete. As mentioned before, we require an exact solution, which is why we can’t add probabilistic approximations to our solution, requiring us to almost completely traverse an exponential search space, unless we get lucky. Here, our solutions are harder to scale because of the growing solution space. On having a large input or problem size, our problem simply becomes bigger - and we found that there was nothing we could really do to accomodate for the growing number of boards. Thus, we need to try an exponential solution space to get to the answer. The only ‘clever’ thing to do is to limit the solution space through the process of elimination - which still requires calculation in order to perform that elimination, and still results in a (smaller) exponential space to traverse.

However, looking at NP-complete problems from a parallelization lens is interesting in terms of thinking about the information required to get to the solution. The interconnectedness and the dependency of the parts and the whole is what makes this problem complex.

OpenMP Results:

In the following graph, sequential and OpenMP implementations were compared by running each implementation on four different board sizes: 8x8, 16x16, 32x32, and 64x64. They were both run on the GHC machines (since Latedays was experiencing downtimes and other issues) and computation times were compared in seconds. Thus, the baseline here is the performance of the sequential code. Further, the OpenMP times were calculated based on the number of threads that minimized computation time (aka the lowest computation time for any number of threads).

The sequential implementation is equivalent to the OpenMP implementation for the first three board sizes, and is faster than the OpenMP implementation for the 64x64 board. One thing to note about this result, though, is that the OpenMP implementation was targeted for the Latedays machines, and did in fact perform slightly better on those machines, when tested previously. That said, the OpenMP implementation did not demonstrate any (significant) speedup compared to the sequential version. The graph below provides us some greater insight into why this is the case.

In general, we observe a slowdown with an increase in the number of threads set to run the OpenMP implementation. This implies that we aren’t efficiently distributing our workloads amongst threads, and we are having most of our threads waiting / not doing significant work. Without distributing the work well such that most of the threads are substantially contributing, and not just waiting, we are incurring the overhead of parallelization with minor speedups from adding threads. Since the overhead outweighs any speedup we may be achieving, we are not seeing any net speedup in comparison to the sequential implementation. This is generally speculation, but the graph above does seem to indicate this happening, at least in part. One reason this could be the case is that there is a lot of dependence between the threads (tasks) that isn’t easily separated out -- the pruning and setting of a previously unset value affects what the other unset spaces can take on as values, and without knowing how other spaces have been constrained / what they have been set to, it is not possible to solve the board, as the combination of all of this information helps lead to the solution in an efficient manner. The parts that can be separated out without problems (e.g. the pruning of any individual space based on range values for other unset spaces) can be easily parallelized, but are generally so small in the grand scheme of computation that they show very little effect on the final speedup. For the parts mentioned above that do need this coordination in order to efficiently arrive at a solution, a layer of communication was necessary, which also further limited the speedups achievable through parallelization (updated range values for unset values and the list of currently unset values needed to be joined together periodically).

CUDA Results:

In the following graph, sequential and CUDA implementations were compared by running each implementation on four boards of size 16x16, with different number of blank values set (aka different problem complexities). They were both run on the GHC machines and computation times were compared in seconds. Thus, the baseline here is the performance of the sequential code. Further, the CUDA times were calculated based on the combination of the number of blocks and the number of threads per block that minimized computation time (aka the lowest computation time for any number of blocks and threads per block).


Clearly, the CUDA implementation is significantly slower than the sequential implementation for all test cases. A large reason for this comes from the fact that, in order to parallelize the computation of a board’s validity and the search for a solution, many boards with different values set for a number of previously unset values needed to be stored and shared amongst all the blocks. This effect can be seen by the fact that, even with the number of blanks held constant, boards greater than size 16x16 mostly ran for exceedingly long periods of time before finding a solution (exact values are not shown here because most had runtimes greater than 5 minutes, and were therefore terminated). The number of boards generated were constrained by the size of the boards array and the number of iterations the iterative board-generating function was called for, but while that mitigated the issue of exceeding memory constraints, it meant that larger boards became far more computationally complex to solve. Since the first step of generating boards was limited, the backtracking step had to do more work in order to fill the rest of the unfilled blanks, and backtrack further when no valid solution was found. While the backtracking was also done in parallel, it still meant that every block had to try a far larger search space (than that compared to a smaller board input) independently, which overall meant more work done. While the exact distribution of times couldn’t be determined between the two kernel functions (board generation vs. backtracking), since the program generally terminated before completion, as mentioned previously, a strong indicator of this theory is the fact that the board generation step would almost always complete in under two minutes, while the backtracking function would go on for far longer. So, while there were significant portions of the code that were being computed in parallel, the program’s memory bound resulted in not enough of search space constraint. Having deemed the narrowing of the search space as being very crucial to solving Kakuro boards efficiently, this was a great downfall in the CUDA implementation. The combination of these reasons can mostly account for the discrepancy in sequential versus CUDA implementations, especially considering that the sequential implementation only needed to store a few boards (a before and after board) at once at maximum, since most of the extra information was stored in a list of ranges for the unset spaces, and it could maintain one central point of updated information, which allowed it to really effectively prune and greatly reduce the search space in each step.

The above figure demonstrates the issue mentioned above with the necessity to share the large array of boards in order to parallelize computation. While there is decent gain in going from one to two blocks, especially when there are few threads per block, the performance becomes far worse when going up to four blocks (and eight, etc.). An explanation for this is that when the array of board is split up into more than two chunks, where each block gets one chunk, the work distribution becomes fairly skewed. This is because the next set of boards generated in the first kernel function (the board-generating function) is computed by finding the first unset space in the first board, and creating boards based on all the possible valid values it could take on. What this means is that between the first block and the last block, there is a huge difference in the complexity of the boards left to be solved. When there are only two blocks, this doesn’t make a big difference because one is only marginally more complex than the other, and the fact that the work is being parallelized between the two blocks makes it faster than just the computation with one block. When we use four blocks though, this difference becomes increasingly apparent, and we most likely end up having one block dominate the computation time until one block finds the right solution.

That said, we do have clear parallelism between threads, as the computation time of the algorithm decreases a lot when we add our first few threads, and still decreases as we add more threads, even though this change becomes more slight as more threads are added, since they’re not being as efficient in their parallelism.

One thing to note is that this analysis is partially speculation based on the graphs shown above, but is also partially the result of doing a deep-dive on the CUDA coda, and seeing exactly how the boards are being generated, how the backtracking is implemented, and why the way it’s being done can adversely affect some blocks / threads, as mentioned above.

OpenMPI Results:

Our OpenMPI implementation is far slower than our other versions. We ran this implementation on our 8x8, 16x16 and 32x32 test boards and found that the 8x8 time was much slower. One obvious reason is the communication size and time between processes. This was expected because we decided to use regular backtracking instead of the pruning algorithm. This was to minimise the frequency and amount of information that would be communicated between the threads (e.g. updated Range List, unassigned indices), however, the tradeoff for that is a lot of useless work - which is worse for speedup than communication. Here, pure brute-force backtracking makes us recursively look at spaces that we wouldn’t do in the pruning algorithm simply because it isn’t intelligent enough to reduce the exploration space using the information from other threads’ exploration. This approach also doesn’t scale well, because of the fixed number of threads that perform exploration down a single starting path. Thus, the larger our space gets, the deeper the recursive call becomes and therefore, our recursive function calls’ dependency and wait time increases.

However that said, the OpenMPI implementation has value because it shed light on an important aspect of this problem - that the independent space exploration is the way forward. We think that looking at spaces in parallel and solving the independent space of the recursive trees is valuable - if we can get it to switch effectively between large task spaces and smaller task spaces respectively. In this implementation, our strength lies in minimising the size of the message and the frequency of the communication between the processes. Here, if we could strike that balance properly i.e., share useful information every couple of steps, instead of at every step, this implementation would have been able to strike the balance between communication and independent task-solving.







References:

Sequential:

OpenMP:

https://cse.buffalo.edu/faculty/miller/Courses/CSE633/Sankar-Spring-2014-CSE633.pdf

CUDA:

OpenMPI:

  • Guide to Solving Sudoku in Parallel:

https://cse.buffalo.edu/faculty/miller/Courses/CSE633/Shreekant-Spring-2014-CSE633.pdf

  • OpenMPI tutorial’s sample starter code for Hello World

http://mpitutorial.com/tutorials/



Work Distribution:

Anika:

  • 80% of sequential implementation of a Kakuro-solver
  • CUDA implementation of a Kakuro-solver
  • 40% of OpenMP implementation of a Kakuro-solver
  • 50% of the project proposal, checkpoint, and final report documents
  • Code to generate reasonable input boards and row, column, and diagonal sums

Naviya:

  • 20% of sequential implementation of a Kakuro-solver
  • OpenMPI implementation of a Kakuro-solver
  • 60% of OpenMP implementation of a Kakuro-solver
  • 50% of the project proposal, checkpoint, and final report documents


Given the list of work performed by each of us, as outlined above, we believe that the total credit for the project should be distributed 50-50 between ourselves.