The known biological complexes are first cleaned up by merging 'similar' complexes, achieved by iteratively merging complexes with a Jaccard score greater than 0.6 until no pair of complexes are 'similar'.
Next, the complexes are split into non-overlapping train and test sets with equal size distributions. This is achieved by first making a say, 90-10 (parameter initSplit) random train-test split. In order to ensure equal size distributions of train-test complexes finally, at this step, say 30% (parameter trainTransfer) of smaller train complexes (size lesser than mean size of train complexes) are transferred to the test complexes, since in the next step, bigger complexes are mainly transferred from train to test complexes. Next, any train complex sharing one or more edges with any of the test complexes is transferred from train to test complexes. This is done iteratively until no train complex shares any edge with any test complex. In each iteration, test complexes with sizes greater than 30 are dropped to prevent high overlap while checking with train complexes. Finally, only train and test complexes that have atleast 3 nodes and edges in the protein interaction network of interest are considered as we will be using only topological features. The parameters initSplit and trainTransfer are optimized to achieve as near an equal train-test size distribution as possible.
Final train-test size distributions for human complexes
Negative complexes are represented by random walks sampled from the network using random seeds, and picking a random neighbor at each step. The number of steps ranges from the minimum size to maximum size of positive complexes. Equal number of random walks with the same number of steps are sampled, to yield an almost uniform size distribution for negative complexes. Thus, the size distribution of positive complexes is taken into consideration while training the machine learning model. In the feature extraction step, random walks resembling complexes are removed. The final number of negative complexes is close to the number of positive complexes, achieved by starting out with a slightly higher number of sampled random walks (via a scale factor).