Songhao Jia (songhaoj) and Xinghao Yu (xinghaoy)
Link to the project's home page: https://sites.google.com/andrew.cmu.edu/15618-final-proj
We are going to implement a workload-balance and communication-efficient version of the Random Forest (RF) algorithm. We are going to first implement a serial version of the RF algorithm. Based on that, we will implement both CPU and GPU parallel versions. Finally, we are going to compare and analyze the three versions, including speed up, scalability, and so on.
The random forest has been proven to be an important and high-performance machine learning algorithm. The idea behind it is to build many shallow decision trees based on the sampled dataset by bootstrapping and ensemble these decision trees for prediction. Such an algorithm has been proven that compared with the decision tree algorithm, it would have less overfitting and better performance on real-world data. The overall algorithm can be summarized as follow:
build_random_forest(dataset, max_tree_height):
all_sampled_datasets = bootstrapping(dataset)
all_trees = [build_tree(mini_dataset, max_tree_height) for mini_dataset in all_sampled_datasets]
build_tree(mini_dataset, max_tree_height):
if mini_dataset is None or maximum_tree_height == 0:
return None
cur_node.split = find_split(mini_dataset)
cur_node.left = build_tree(all label=false in mini_dataset - split, max_tree_height-1)
cur_node.right = build_tree(all label=true in mini_dataset - split, max_tree_height-1)
Implementing the parallel version of the Random Forest algorithm would encounter the following challenging:
Data Locality. As the pseudo-code showed above, the mini dataset for building each decision tree would be chosen from the original dataset randomly by bootstrapping algorithm, indicating that the locality of data for each computation unit (each core in CPU parallel version, each block in GPU parallel version) would be low, which would lead to a lot of time spent on data transfer. During implementation, we need to properly arrange the task to further identify and leverage both spatial locality and temporary locality.
Fully utilize parallel computation resources. According to our survey and analysis, two high-level strategies could be adopted to construct the forest. First is the Depth-First algorithm. The whole GPU would be used to compute the optimal split-point for a single node of the decision tree. The other would be the Breadth-First algorithm, which would set up the same level node of different decision trees simultaneously. The two methods have their own drawback. For the beginning stage, since the total number of branches is relatively small and the computation cost for splitting in each branch is relatively high, the Breadth-First algorithm could not maximize the use of parallel computing resources. On the other hand, when trees go deeper, using the Depth-First algorithm would also waste computing resources due to the large overhead of invoking GPU kernels. We have to figure out another solution to better utilize the parallel computation resources.
Workload balance. Based on the analysis above, we can find that no matter which algorithm we adopt, the workload for each computation unit would be hard to guarantee balance. The workload would be decided by a) the part of the dataset we focus on currently b) the current structure of the decision tree and c) whether the decision tree would stop growing early (meeting max_height, all data in the current dataset belongs to the same label, etc).
We will need computers with multi-core CPU and GPU for implementing CPU and GPU parallel versions of the random forests. GHC machine would be perfect for our project. We will implement our serial version random forests based on some open-sourced code such as scikit-learn and some github repositories like [1] and [2]. We will further implement the parallel version rely on the help of paper [3] and [4].
[1] https://github.com/bjoern-andres/random-forest
[2] https://github.com/handspeaker/RandomForests
[3] A CUDA Implementation of Random Forests - Early Results, Håkan Grahn, Niklas Lavesson, Mikael Hellborg Lapajne, and Daniel Slat, School of Computing, Blekinge Institute of Technology
[4] Learning Random Forests on the GPU, Yisheng Liao, Alex Rubinsteyn, Russell Power, Jinyang Li, Department of Computer Science, New York University
Goals plan to achieve:
1. Implement serial version of random forest.
2. Implement CUDA based random forest using GPU.
3. Implement OpenMPI based random forest using CPU.
4. Analysis of speedup and accuracy of different parallel approaches.
Goals hope to achieve:
1. High speedup with CUDA and OpenMPI parallelization compared to serial version.
2. High accuracy of the trained model evaluated on test dataset.
Deliverables:
1. A graph that compares speedup of different parallelization strategies.
2. A graph that compares accuracy of the model trained by using different parallelization strategies.
3. In our poster, we plan to answer the following questions: What parallelization approach is most suitable for random forest model? On what types of dataset/workload does the model achieve best performance & accuracy?
GHC machine, C++
Finish serial version random forest – November 22nd
Finish GPU version random forest – November 29th
Finish OpenMPI data-parallel version random forest – December 6th
Finish analysis of performance – December 9th