In order to try improving upon the data from project A, the COMPARTMENTS confidence scores and SignalP prediction scores are used in this project to build a feature matrix to be fed into a classifier for subcellular localisation. Additionally, the co-fractionation mass-spectronometry (CFMS) and amino acid frequency (AA freq.) data sets from previous homework in this course are added as well. Since the CFMS and AA freq. datasets both themselves already have a large number of features (2057 fraction abundances, and 20 AA frequencies), principal component analysis is used to reduce the dimensionality of these two datasets prior to entering them into the classifier.
Figure 1 | Workflow for building feature matrix for building a classifier of protein localisation. For this project, extracellular proteins were chosen as an example.
The co-fractionation mass spectrometry (CFMS) and amino acid (AA) frequency data was acquired from the BCH 394P course website. These datasets contain 1832 human proteins, which is only a subset of the proteins used in the original experiment (Wan et al. 2015) and can be found via the ID PXD002322 on the ProteomeXchange database.
The subcellular localisation of this set of 1832 proteins was retrieved from UniProtKB. Additionally, the COMPARTMENT db (Binder et al. 2014) localisation confidence score was also fetched, and the entries were searched for predicted signal peptides via the online signal peptide predictor SignalP 6.0 (Teufel et al. 2022).
In order to reduce the dimensionality of the CFMS and AA frequency data, PCA was employed. PCA was executed using OriginLab, which was also used to create the corresponding Scree plots and PCA scatter + loading biplots.
A feature matrix containing up to four different sources of information (AA frequency, CFMS times, SignalP prediction, COMPARTMENTS data base confidence score) plus a binary localisation flag fetched from UniProtKB was built:
AA freq. (7 PCs) | CFMS times (20 PCs) | SignalP prediction | COMPARTMENTS score | Localisation flag
To build a classifier, this feature matrix was fed into Weka, a tool that contains a collection of machine learning algorithms. Bagging Random Forest classification was chosen, along with the default settings of 10-fold cross-validation, at 100 iterations with base learning.
Weka was also used to plot precision-recall curves (PRC, or R-P curve) for the positive entries (i. e. for the proteins identified as in the compartment of interest). PRC was chosen over e. g. ROC curves (plots true positive vs. vs. false positive rate) because it is better suited for data like this in which the ratio of positive vs negative entries is far from 1 (for extracellular protein: 555:1277 entries, 30%).
To reduce the dimensionality of the AA frequency and CFMS datasets, PCA was utilised. Scree plots (ranking of each principal component by their eigenvalue) were graphed to visually aid the selection of a sensible number of PCs (see Figure 2) to use for later classification.
In both Scree plots, there is a noticable strong change ("elbow bend") in the slope of adjacent line segments at PC number 7, which usually indicates a good number of PCs to pick. This was done for the AA frequency PCA. For the AA frequency PCA, dimensionality was reduced from 20+1 (20 residues plus localisation flag) to 7+1. This captures 57.10% of data variance. For the CFMS data, 7 PCs only cover 39.8% of data variance, so the top 20 PCs were picked instead (62.11% of data variance). Thus, CFMS dimensionality was reduced from 2057+1 (CFMS times + flag) to 20+1.
PCA scatter + loading biplots (Figure 3) show the data separation by the two first principal components, and also how strongly affected they are by each feature. Both scatter plots show very weak separation for the first two PCs. The AA freq. loading plot seems to loosely group the residues into charged, polar and hydrophobic clusters, although there are some outliers (e. g. R not being with the charged AAs, P being with the polar AAs). For the CFMS PCA, no loading plot could be generated because Scrappie, my laptop, could not handle it.
Figure 2 | PCA Scree plot of extracellular vs other protein’s AA frequency (left) and co-fractional MS times. For the AA frequency PCA, the top 7 PCs were chosen for the classifier (captures 57.10% of data variance), for the CFMS data the top 20 PCs were picked (62.11% of data variance). Inset shows full data range.
Figure 3 | PCA biplot of AA composition (left) and CFMS (right) of extracellular proteins vs. other proteins. Bottom and left axes: PCA scatter plot of proteins based on the two strongest PC, extracellular proteins are shown in blue, others in grey. Top and right axes: loading plot. The vectors of each residue indicate their influence on the PCs. The right graph lacks the loading plot because Scrappie (my dear beloved laptop) crashes when trying to load it.
A random forest based classifier of extracellular localisation was built using the a feature matrix containing up to four different sources of information (AA frequency, CFMS times, SignalP prediction, COMPARTMENTS data base confidence score) plus a binary localisation flag fetched from UniProtKB:
AA freq. (7 PCs) | CFMS times (20 PCs) | SignalP prediction | COMPARTMENTS score | Localisation flag
Multiple R-P curves were plotted for the different combinations of information (Figure 4). Out of all classifiers, the one reliant solely on SignalP signal peptide prediction scores performed the worst (PRC area 0.32). This is unsurprising given that many of the 1832 proteins of the dataset are proteins in the Sec pathway, but are not necessarily secreted (instead they may end up e. g. in the lysosome). Compared to this, the AA frequency based performed much better (0.51). The classifier based only on the COMPARTMENTS db confidence score is only minimally better (0.57), which is surprising, since the COMPARTMENTS db is partially based upon data from UniprotKB, where the binary flag was pulled from.
The combination of all four feature groups gives rise to the best classifier (0.782). Removing the AA frequency feature group results in a classifer that is basically equally as good (0.777).
The detailed quality statistics of all classifiers are listed in the table below.
Figure 4 | PRC curves for RandomForest-based classifier of extracellular vs other proteins. Based on AA frequencies (using 7 PCs), co-fractionation MS times (20 PCs), SignalP prediction of signal peptide, localisation confidence score of COMPARTMENTS database and binary localisation flag from UniprotKB.
Table 1 | Quality statistics for extracellular localisation classifiers built with up to 4 sources of information.