Methodology
This project is using a hypergraph partitioner, BiPart. BiPart is a multi-level hypergraph partitioner developed by S. Maleki, U. Agarwal at the University of Texas at Austin. In order to use this tool as a community detection in protein protein interaction network, I had to add new features to the existing tool. In this section, I briefly explain the algorithm behind BiPart. Then, we will explain how to use this hypergraph partitioner to detect communities in a PPI network.
What is BiPart?
BiPart is a deterministic, parallel hypergraph partitioner. The algorithms in BiPart are based on multi-level hypergraph partitioning [6]. These algorithms, as illustrated in this figure , reduce the size of the graph (or hypergraph) by collapsing vertices (coarsening phase), partition the smaller graph (initial partitioning phase), and then uncoarsen to construct a partition for the original graph (uncoarsening and refinement phase).
The objective function we are trying to reduce in this problem is total cut, number of edges (hyperedges) connecting different partitions. We add another restriction to the objective function and that is to minimize the sum of weights of the cut. Since our protein protein interaction network comes with a score on each edge i.e. a threshold of significance to include a interaction. By choosing such objective function for our algorithm, the hope is to generate partitions with nodes in each partition connected by a large score.
Choosing the number of partitions
In graph/hypergraph partitioning problems, we need to choose the number of partitions in advance. In this problem however, we are unaware of the number of communities (hyperedges). The approach I used to overcome this difficulty is to guess the number of communities and run the partitioner to create that many partitions (communities). I designed an evaluating system that would evaluate the quality of the detected communities and it will stop if the result is good otherwise, it creates one extra partition in the next level.
In this project, I start with N / 2 number of partitions where N is the number of nodes in my network. Once communities are created, I check the quality. I have two notion of quality. The weighted total cut tried to create communities with nodes strongly connected to each other but once my communities are created, we cannot use the edge cut to evaluate them.
To overcome this problem, I designed a system to evaluate the generated communities.
Once K communities are generated (K starts with N/2), I randomly remove 10% of the edges in the network and re-run the partitioner to generate the same number of communities with the new network. Then, I compare the result of the new community with the original. If my partitioner prediction is good enough i.e. 98% of the generated communities have similar nodes as the original, I would stop and return the generated partitions as potential communities. Otherwise, I re-run the partitioner and generate K+1 partition and re-evaluate them.