Comparing the algorithms, we can see that the greedy algorithm while very fast yields poor results. The other 3 algorithms yield similar results, with Greedy-top neighbors being more efficient. However, metropolis and iterative simulated annealing (ISA) sometimes outperform Greedy-top neighbors in terms of F1 score, with Metropolis generally giving better results when compared to ISA.
Comparing variants 1 and 2, we can conclude that while the reduced network yields higher recall, the full network yields higher precision values. This is perhaps since for a reduced network, only proteins of known complexes are present in the network and so, the known complexes are recovered better, leading to higher recall.
While the F1 scores do not differ dramatically between the 2 variants, the speed of the algorithms on the reduced network is more than 10x faster.
This was done to primarily compare our methods with some of the state-of-the-art supervised algorithms. In the first variant, metropolis outperforms RM (RegressionModel) and SCI-SVM (described in Related Works) in terms of Recall. In the 2nd variant, all 3 algorithms - Metropolis, Iterative Simulated Annealing and Metropolis outperform SCI-SVM and SCI-BN in terms of F1-scores.
For the CombYeast PPIN, all the algorithms give better Recall values when compared to Precision, with Iterative Simulated Annealing giving the best Recall.
the background.
Small - green
Medium - blue
Big - purple
There are two complexes predicted on the left, a smaller one in green and a bigger one in blue. This is a nice depiction of how the algorithm (metropolis) is able to distinguish two complexes that are connected by one node. The left complex most closely resembles AFF4 super elongation complex (SEC) which could be a key regulator in the pathogenesis of leukemia, while the right one is unknown.
On the right, the dense complex most closely resembles the CCR4-NOT transcription complex , called the 'control-freak of eukaryotic cells'.
Lin, C., Smith, E. R., Takahashi, H., Lai, K. C., Martin-Brown, S., Florens, L., ... & Shilatifard, A. (2010). AFF4, a component of the ELL/P-TEFb elongation complex and a shared subunit of MLL chimeras, can link transcription elongation to leukemia. Molecular cell, 37(3), 429-437.
Miller, J. E., & Reese, J. C. (2012). Ccr4-Not complex: the control freak of eukaryotic cells. Critical reviews in biochemistry and molecular biology, 47(4), 315-333.
Small - Yellow
Big - Brown
An unknown complex ->
Small - Pink
Big - Purple
In this work, we demonstrate the potential of supervised ML strategies to find protein complexes in protein interaction networks. We propose a streamlined computational pipeline Super. Complex going through the steps of data preparation, an auto-ML pipeline and a subgraph sampling process with a choice between different algorithms. Through our experiments, we find that in general, GradientBoosing Classifier and the Metropolis sampling algorithm tend to perform well. Further, using a reduced PPIN with proteins from known complexes is much faster than using the full network, while yielding similar accuracies in results. While the pipeline yields comparable and slightly better results than some state-of-the-art algorithms, this plug and play pipeline can be improved by further optimizing each individual part, for instance by using biological features in addition to topological features. Experiments were performed on real PPIN datasets of human and yeast, and the results can be further examined to glean more biological insights.