Team members: Kun Liu (kunl2), Yucheng Hu (yhu3)
We will implement a parallel k-d tree and K-nearest neighbor (KNN) algorithm.
The k-d tree is a binary tree in which every node is a k-dimensional point. Every non-leaf node can be thought of as implicitly generating a splitting hyperplane that divides the space into two parts, known as half-spaces. Points to the left of this hyperplane are represented by the left subtree of that node and points to the right of the hyperplane are represented by the right subtree [2].
Below is an example of building a k-d tree for 2D data points (left), and it will form an uneven grid (right). After building the k-d tree, we could find the k-nearest neighbors for a given point (not necessarily the point used to build the tree) by traversing the tree.
When building a k-d tree, we need to find the point whose value in one dimension is the median among all the points in the same grid. This is challenging when there are a lot of points.
There might be many different schemes. First of all, there is an alternative partition among all the n dimensions. Is it possible to parallel among dimensions? Second, if it is possible to parallel the searching of the median along each dimension?
Assume we are using the k-d tree to do classification, and we need to traverse the tree to find the k-nearest neighbors for a given point. However, in real applications, there are many points needed to do so. However, traversing the tree parallel is challenging.
We found some papers for reference:
Parallel Algorithms For Nearest Neighbor Search Problems In High Dimensions [1]
PKNN-MIFS: A Parallel KNN Classifier over an Optimal Subset of Features [5]
PANDA: Extreme Scale Parallel K-Nearest Neighbor on Distributed Architectures [6]
Also, we found some tutorials [3-4] explaining the k-d tree and KNN in detail.
Implement two versions of the KNN algorithm using k-d tree, one for the sequential and one for the parallel. Real-world data under various situations should be collected to evaluate our implementation.
In the parallel version of the implementation, we plan to utilize OpenMP’s power to parallel the computation of the workload.
Considering that we plan to implement this in the GHC machines and use 8 cores, a speedup approximated to 8x versus the sequential version is expected.
We plan to plot some graphs and tables about the speedup of the parallelized algorithm under different workloads. In addition, we also want to give some insights about which parts of the computation we have optimized in the poster session.
If we have more spare time, we also hope to:
Implement the parallel version of the algorithm using the MPI framework.
To make this work more intuitive, we hope to visualize some results, e.g., plotting division of different clusters on the grid.
We will run and test our algorithms on the GHC machine. Reason:
Since the OpenMP and the MPI framework are already installed on GHC machines, it will help us reduce the time spent on configuring the development environment. Thus, we are able to devote more time researching the algorithm and frameworks.
GHC machines have more powerful computational power than our PCs. Our team members can ignore the hardware and software difference between our PCs as well, making it more friendly for us to cooperate with.
PSC machines’ computational resources are highly limited. Users need to reserve the resources before running their code.
11/09 - 11/16: Read related materials about the KNN algorithm and k-d tree. Set up sequential KNN algorithm code skeleton. [Done]
11/17 - 11/24: Set up sequential KNN algorithms and have some simple benchmarks based on the performance. [Done]
11/25 - 12/28: Construct different workloads and generate concrete benchmarks [Done].
12/29 - 12/02: Implement the parallelized KNN algorithm with OpenMP (a naive version). Compare the speedup with the sequential version. [Done, Both]
12/03 - 12/10: Optimize the parallelized KNN algorithm and compare different schemes under different workloads. [Done, Both]
12/11- 12/14: Analyze the speedup of parallelized version over the sequential version with graphs [Done, Kun Liu]. Compare the performance between different schemes with graphs [Done, Yucheng Hu].
12/15 - 12/17: Prepare final project report and poster [Done, Both].