Based on the training cross-validation score, the test average precision score and the test Precision-Recall curve, we select the best classifier for each experiment. All these metrics are considered equally important in the process of selection of the best classifier for use in the sampling process.
Variant 1 - MIPS train, TAP-MS test
Variant 2 - TAP-MS train, MIPS test
It is interesting to note that the superior GradientBoostingClassifier performs the best only when train and test datasets have similar biases, such as in Experiments 1 and 3 where the complexes are split into train and test while ensuring equal size distributions.
Note that in the 2nd experiment, the classifier is trained on one dataset and tested on a dataset which comes from another source, meaning that the datasets can have different inherent biases.