Naviya Singla and Anika Sindhwani
We will implement a Kakuro-Solver sequentially and in parallel using PyCuda, CUDA, OpenMPI, and OpenMP frameworks. Then, we will compare the performance of those implementations and try to isolate the cases for which one implementation works better than the others, to further examine which frameworks lead to the best results (fastest speedups). The last step in this project would be to expand and explain the performance for a more restricted version of the game, and see how and why these restrictions affect the performance.
Kakuro is a Sudoku-esque puzzle where the objective is fill the blank grid with distinct numbers from 1-9, using the sum-clues given such that the sum of the numbers in a diagonal, block, row or column sum up to the sum-clue given.
While that is the game, there are variations of the game that might help us improve the performance of it. For example, if we allow repeating numbers along a block/diagonal/row/column, or calculate with a only a fixed set of numbers, or perform a different operation instead of addition. Here, changing certain “rules” or “restrictions” of the game, might provide us insights about how to improve and analyze the performance of our implementations and algorithms.
One way that parallelization could help improve the runtime of this problem is by solving the board in 3x3 chunks at a time (there would be 9 of these in total for a 9x9 Kakuro board). CUDA could be especially useful here in terms of breaking down the project into different blocks (each block would be a 3x3 chunk) and threads computing the numbers within that block. One thing that would have to be considered here is that we’d have to try out multiple different sum values for each block row/column/diagonal, since we would be partitioning the board, and also accounting for repeating numbers in the same row/column/diagonal. Since we are considering a more restricted version of the game as well though, it would be a lot easier to parallelize the solver when not having to account for this.
Given that solving this puzzle requires information sharing across all dimensions - horizontal, vertical, block wise, and diagonally - it is interesting to think of algorithmic ways of solving it faster, as well as using it as a way to compare implementation frameworks. Here our main challenge would be to think of implementation specific optimizations that we would have to make for each implementation and reason about what the best programming model abstractions are for each method. Finally, tying all our observations together to explain the best performing implementation - would have to factor in several things - which would be interesting to think about, and relevant to the discourse about choosing a framework for parallel puzzle solvers.
Our final goal to come up with a set of results/benchmarks for what implementations/optimizations make the Kakuro-solver perform better, and be able to justify those observations.
For the first week, our goal is to do a meta-analysis of Kakuro-solvers and set up our sequential reference implementation, that focuses primarily on correctness, and trying out the algorithmic optimizations that we find. Once we fix that, our next week would be spent implementing the CUDA and pyCUDA implementations - focusing obviously on correctness but also on the CUDA specific optimizations that we can make to improve our performance. Finally, we will implement the OpenMP and OpenMPI implementations.
The first deliverable is this proposal report, the second is our checkpoint report which will comprise of the code for sequential, CUDA and pyCUDA along with the observations and our analysis of those observations. The final deliverable is the report with all of our work, observations and the reasoning over our observations. We will also present our work as a poster during the final class presentation on Dec 16.
We will use some of the starter code from Assignment 2, 3 and 4 to set up the project and the libraries. In addition, we may do a meta-analysis of existing Kakuro-solvers in search for the algorithmic optimizations that could make our implementations faster.
In this project, we compare several frameworks - OpenMP, OpenMPI, PyCUDA and CUDA. Similar to Assignment 2, we will use the GHC machines at ghc.andrew.cmu.edu for all our implementations. For all implementations (except PyCUDA), we use C/C++ depending on what language the above mentioned starter code is in. For PyCUDA, we use Python2 and on the ghc.andrew.cmu.edu machines.
Nov 1 - PROPOSAL
Nov 1-7 - meta-analysis, sequential implementation running
Nov 8-15 - CUDA/PyCUDA running
Nov 16-23 - prep for checkpoint report, OpenMPI implementation running
Nov 19 - CHECKPOINT DUE
Nov 23 - 30 - OpenMP implementation running, reasoning over results
Dec 1 - 7 - running experiments with restrictions on the problem
Dec 7 - 14 - clean up/ document and create report
Dec 15 - FINAL REPORT DUE