15618 Final Report

Milestone report

Project Schedule

Project Schedule

Completed So far

Until now, we have completed two main tasks.

The first one is that we have found a serial implementation of Random Forest and had a full understanding of its code. Since the target of the final project is to design a parallel solution to a problem, implementation of the serial one is not that important and wastes our time. We decided to develop our parallel algorithm based on the existing serial one.

The second one is that we have some progress on parallel implementation of both the basic GPU version and the CPU version. For the basic GPU version, our idea is just to make each GPU thread construct one decision tree without further considering workload balance and data balance. There has been some progress and we think we can get it done in the first half of the next week. For the CPU version, we have successfully developed two basic parallel algorithms using Pthreads and OpenMP, which yielded significant performance improvements. Details and performance can be found in the following Preliminary Result section.

Deliverables and Goals

Compared with previous goals, we decided not to implement the serial version by ourselves and developed the parallel version based on the existing serial implementation. We also planned to implement optimized OpenMP and Pthread versions for the CPU parallel. It would be nice for us to implement an OpenMPI version for data-parallel. We believed that we are on schedule and we could produce all deliverables.

Our updated goals will be:

Implement parallel random forest on GPU using CUDA.
Implement parallel random forest on CPU using OpenMP, Pthread.

Nice to have:

Implement data-parallel version of the random forest using OpenMPI.
High speedup compared to the serial version.
Comparable accuracy on test data set.

Poster Session

For the poster session, we plan to show the data related to the speedup of our parallel algorithm on both CPU and GPU using graphs. We would further compare the speedup between different algorithms we adopted during the development.

Preliminary Results

Here is the preliminary results we get from CPU parallel version. For the OpenMP version, it's achieved by parallelizing the work over all decision trees, for the pthread version, it's achived by parallelizing the work on the same level when doing level-order traversal in a single decision tree. We evaluated the performance on the MNIST dataset with 100 decision trees, max_depth=10, the execution time and error rates are listed below. In general, currently the OpenMP version that parallelize over decision trees achieved the best speedup at around 6.4x.

preliminary result

Issues that concern the most

Currently, we have the following matters to address:

A lot of work needs to be done to convert the CPU serial code into GPU parallel code, which currently takes a lot of our time.
How to decrease the data loading time is a critical part of the GPU parallel version. Ideally, we hope that threads in the same block could process the neighboring data. However, according to the Random Forest algorithm, the data each node needs depends on a) how its parent node divides and b) some random processes, making maintaining the assumption very difficult.
We may need to collect some research work to find better parallel strategies to optimize our performance, this may take a significant amount of time.

Page updated

Report abuse