checkpoint

In one to two paragraphs, summarize the work that you have completed so far.

So far, we have completed the sequential version of Kakuro and most of the openMP version of Kakuro. Our OpenMP version of Kakuro is functional, but it does not have significant speedup in comparison to the sequential version of Kakuro as of right now, so we are still making modifications in terms of weighing adding more parallelism with adding more overhead with atomicizing critical variables. As of now, we are parallelizing over our for loops, that simply check for a particular board state in parallel. At the moment, it is very basic parallelism - for example, checking rowSums, colSums in parallel over each row and column; keeping local copies of a board inside a for loop that simply checks one value for a cell in parallel. The basic structure of our code can be thought of as a repeated prune over the range of possible values for a cell; as we fix values of cells, we limit the range of values possible for other cells and so on.

In terms of our Kakuro solver, we made the following modifications to the original pen-and-paper game: 1. We transformed the game board into a full n x n board (that don’t have blocked-off squares that are value-less, unlike in the original game) so that it would be easy to pass in to the solver and wouldn’t require as advanced manipulation of the board itself, since we felt that that was tangential to the study of parallelizing a Kakuro solver; 2. Each cell in the board can take on values 0-9; 3. The values 0-9 can be repeated in a given row/column/diagonal, which is especially important because the sequential version is trivially fast with boards of size 9x9 and smaller. We may consider removing / modifying these restrictions to compare sequential versions and analyze how different constraints affect the parallel speedups we can achieve, but this is now a stretch goal.

Describe how you are doing with respect to the goals and deliverables stated in your proposal. Do you still believe you will be able to produce all your deliverables? If not, why? What about the ”nice to haves”? In your checkpoint writeup we want a new list of goals that you plan to hit for the poster session.

Based on the schedule we had set for ourselves, we should have completed our sequential and CUDA implementations. Given that we are done with our sequential version and an openMP version - we think we are on track. We would like our openMP version to be faster, so that may set us back a little bit. But we have ideas for further parallelizing the openMP version that we intend to implement as soon as possible in order to stay on track.

In terms of deliverables, we aren’t doing the pyCUDA portion because it doesn’t offer any better insights besides the fact that Python as a language is slower than C-like languages, so we instead wish to focus on our other implementations and think about ways to come up with interesting parallelism (according to the proposal feedback we received).

We pruned a bit of our original proposal once we started attempting to implement our Kakuro-solver because we realized that parts of it seemed infeasible / not useful. The parts we removed/changed to stretch goals are:

Having different sequential versions of the solver (different solver/board constraints) and writing parallel versions based on each of these sequential versions / constraints in the parallel programming interfaces we were examining.

Examining both pyCUDA and CUDA -- for this, we would need to add another sequential implementation in Python in order to make initial comparisons, which seems tangential to the comparison of different parallelization interfaces / strategies, and also would mainly just be a comparison of Python vs. C++, which wouldn’t be a novel / interesting finding.

What do you plan to show at the poster session? Will it be a demo? Will it be a graph?

We plan to show plots of:

Times for the sequential version of Kakuro with different sized boards (8, 16, 32, 64)
Comparisons of speedup for the parallel versions of Kakuro with different sized boards (8, 16, 32, 64)
- Parallel versions:
  - openMP
  - openMPI
  - CUDA
Comparisons of speedup for the parallel versions of Kakuro with different thread counts (2, 4, 8, 16) and different sized boards (8, 16, 32, 64)
Comparisons between the times of the sequential version of Kakuro and the parallel versions for different sized boards, over all the different thread counts

Do you have preliminary results at this time? If so, it would be great to included them in your checkpoint write-up.

Average sequential Kakuro-solver times:

8x8 board: 0.000381
16x16 board: 0.001174
32x32 board: 0.061328
64x64 board: 0.797192

Average openMP Kakuro-solver times:

List the issues that concern you the most. Are there any remaining unknowns (things you simply don’t know how to solve, or resource you don’t know how to get) or is it just a matter of coding and doing the work? If you do not wish to put this information on a public web site you are welcome to email the staff directly.

Now, while working on the openMP solver, we have trouble reasoning over how to deal with tradeoffs between local memory allocation and parallelizing over the state of the board. Our speed-ups aren’t great, which poses a problem - but we think our challenge here is to fine-tune the parallelism and think abt tradeoff in terms of making operations atomic with adding parallelism.

Our biggest challenge right now is to come up with interesting ways to parallelize the solver, and even think about a way to parallelize with CUDA in the most intuitive way possible. For CUDA, we were thinking of dividing the board up into blocks (like we did in Assignment 2), and find the ranges locally for cells in all blocks. Then, we will share that information across all blocks to be able to fix some values, and then continue solve for the same blocks to get updated cell ranges, until all values are fixed. While we think that might work, the implementation is turning out to be more of a challenge than we expected, but we are working on it! It was also surprising to us to see no significant speedup using openMP, and we think and hope we will be able to change that within the next few days with better tuning of our parallelism.

Report abuse