6 protein banchmark — Six benchmark datasets - A python implementation of ProtNN and the protein to graph transformation methods


W. Dhifli, A. B. Diallo. ProtNN: Fast and Accurate Protein 3D-Structure Classification in Structural and Topological Space. BMC BioData Mining 9 (1): 1-17, 2016.

ProtNN is a protein 3D-structure classification approach. ProtNN assigns to a query protein the class with the highest number of votes across the set of k nearest neighbor reference proteins based on a graph representation model and a pairwise similarity between vector embedding of both the query and the reference protein-graphs in structural and topological spaces.


We evaluate the classification performance of ProtNN on six benchmark datasets of protein 3D-structures from the literature. Each dataset is composed of positive protein examples that are from a selected protein family, and negative protein examples that are randomly sampled from the PDB. Table summarizes the characteristics of the six datasets.

SCOP ID: identifier of protein family in SCOP, Pos.: number of positive examples, Neg.: number of negative examples.

Experimental Protocol and Settings

Experiments were conducted on a CentOS Linux workstation with an Intel core-i7 CPU at 3.40 GHz, and 16.00 GB of RAM. All the experiments are performed in a single process mode with no parallelization. To transform protein into graph, we used a δ value of 7A˚ . The evaluation measure is the classification accuracy, and the evaluation technique is Leave-One-Out (LOO) where each dataset is used to create N classification scenarios, where N is the number of proteins in the dataset. In each scenario, a reference protein is used as a query instance and the rest of the dataset is used as reference. The aim is to correctly predict the class of the query protein. The classification accuracy for each dataset is averaged over results of all the N evaluations.

ProtNN Classification Results

Comparison with Other Classification Techniques

We compare our approach with multiple state-of-the-art approaches for protein structure classification namely: sequence alignment-based classification (using Blast), structural alignment-based classification (using Combinatorial Extension (CE), Sheba, and FatCat), and substructure(subgraph)-based classification (using GAIA, LPGBCMP, and D&D). For sequence and structural alignment-based classification, we align each protein against all the rest of the dataset. We assign to the query protein the class of the reference protein with the best hit score. For the substructure-based approaches, all the selected approaches are mainly for mining discriminative subgraphs. The discovered substructures are considered as features for describing each example of the original data. The constructed description matrix is used for training in the classification. For our approach, we show the classification accuracy results of ProtNN with Recursive Feature Elimination (RFE) using the std-Euclidean distance. We also show the best results of ProtNN (denoted ProtNN*) with RFE using each of the top-five distance measures. We use k = 1 both for ProtNN and ProtNN*.

The alignment-based approaches FatCat and Sheba outperformed CE, Blast, and all the subgraph-based approaches. For the subgraph-based approaches, D&D scored better than LPGBCMP and GAIA on all cases except with DS1 where GAIA scored best. On average, ProtNN* ranked first with the smallest distance between its results and the best obtained accuracies with each dataset.

Accuracy comparison of ProtNN with other classification techniques.

Scalability and Runtime Analysis

Besides being accurate, an efficient approach for functional classification of protein 3D-structures has to be very fast in order to provide practical usage that meets the increasing load of data in real-world applications. We study the runtime of ProtNN and FatCat, the most competitive approach in our previous comparative experiments. We analyze the variation of runtime for both approaches with higher numbers of proteins ranging from 10 to 100 3D-structures with a step-size of 10. A huge gap is clearly observed between the runtime of ProtNN and that of FatCat. The gap gets larger with higher numbers of proteins. Indeed, FatCat took over 5570 seconds with the 100 proteins while ProtNN (all) did not exceed 118 seconds for the same set which means that our approach is 47x faster than FatCat on that experiment.

Scalability to a PDB-wide classification

We evaluate the scalability of PROTNN on the classification of the entire Protein Data Bank (we use 94126 structure representing all the available protein 3D-structures in the PDB by the end of July 2014). We also show the runtime for FatCat and CE (the structural alignment approaches used in the PDB website). We recall that the experiments are performed in single process mode with no parallelization for all the approaches. It is clear that the computation of attributes is the most expensive part of our approach as some attributes are very complex. However, building the graph models and the computation of attributes represent the preprocessing step and are only performed once for the reference database. All and all, PROTNN runtime was less than a week with an average runtime of 5.9 seconds for the preprocessing and classification of each protein 3D-structure against the entire PDB. On the other hand, both FatCat and CE did not finish running within two weeks. Thus, for comparison, we computed the average runtime for each approach on the classification of a sample of 100 protein 3D-structures against all the PDB. On average, FatCat and CE took respectively more than 52 and 58 hours per protein making our approach faster with thousands orders of magnitude on the classification of a 3D-structure against the entire PDB.

Runtime results of ProtNN, FatCat and CE on the entire Protein Data Bank.