Results and Discussion

Random Forest


For Random Forest (RF) classification, the data was split into training and testing groups using a 70:30 split. The training data was then subject to RF classification seeded with 10 trees using all 27 features. The model was assessed in terms of accuracy, specificity, sensitivity and precision. The prediction results for (disulfide=1, no-disulfide=0) are summarized as follows:

Accuracy - 0.935

Sensitivity - 0.954

Specificity - 0.916

Precision - 0.918

We hypothesized that a smaller and more significant set of features may improve model performance. To identify features for elimination, we calculated the importance of each feature to the forest of trees and ranked them in decreasing importance. We arbitrarily picked the top 5 features and re-trained the model with the same training and testing data. The top 5 important features are Cb:Cb, O:C, C:Cb, N:Cb, Ca:Cb, respectively. The results of the reduced model led to slightly worse model performance and are summarized below:

Accuracy - 0.905

Sensitivity - 0.899

Specificity - 0.911

Precision - 0.910

Likewise, we wondered whether the “standard practice” of looking solely at Cb:Cb distances would give equivalent performance with a single decision tree and re-trained with these parameters. The results are summarized below and suggest that addition features beyond Cb:Cb distances influence the ability to accurately predict disulfide bond formation.

Accuracy - 0.768

Sensitivity - 0.753

Specificity - 0.784

Precision - 0.781

To determine whether Cb:Cb distance is absolutely necessary for accurate prediction, we performed classification, but with only the Cb:Cb feature left out. The results are described below and likewise suggest that additional features are important for predicting whether a disulfide bond will exist.

Accuracy - 0.910

Sensitivity - 0.935

Specificity - 0.887

Precision - 0.890


Figure 3. Sorted visualization of the importance scores of each feature.

Figure 4. Enhanced view of figure 4 denoting the five relevant features for the predictive model.

Validation


To test our model against a biologically relevant scenario, we selected the pre-fusion structure of Human Cytomegalovirus (HCMV) glycoprotein B. Cytomegalovirus is a common beta-herpes virus that infects people of all ages. By age 40, over half of adults, globally, will have been infected though most people show no signs or symptoms. However, HCMV is the leading cause of congenital birth defects in the western world. Congenital infection with CMV causes auditory, cognitive and neurologic impairment in infants and affects around 1% of all newborns. Given the severity and importance of the disease, the development of a vaccine for the prevention of congenital CMV infection is a top priority for the US Institute of Medicine. While the first clinical trials for a CMV vaccine completed well over 30 years ago, an effective vaccine remains elusive. Recently, vaccine antigen design efforts have focused on HCMV glycoprotein B, a class III viral fusion protein that mediates virus-host membrane fusion. As the causative agent for viral entry in human cells, HCMV GB is an attractive vaccine candidate.

Membrane fusion is an essential process for viral entry into cells. Viral fusion with the host cell membrane is mediated by viral surface glycoproteins. Upon triggering, the fusion proteins decorating the viral surface undergo a substantial conformational change whereby they essentially invert and insert “harpoons” of hydrophobic fusion loops into the host cell membrane. Next, the fusion proteins undergo subsequent conformational changes that bring the two membranes within close proximity until they ultimately fuse, thereby creating a pore and allowing for viral entry.

The HCMV GB protein exists as a metastable structure on the surface of the viral envelope. When recombinantly expressed, HCMV GB is found to adopt its post-fusion structure. Viral fusion proteins in their prefusion states have been shown to be more effective at eliciting broadly neutralizing antibodies and result in greater recombinant expression. Therefore, there is significant interest in designing a prefusion stabilized HCMV GB antigen.

Common strategies to improve protein stability are the addition of cavity filling mutations, salt bridges, hydrogen bonds, proline residues and the addition of disulfide bonds. The latter is unique in that the cysteine thiol group (-SH) is capable of forming a reversible covalent S-S bond. The cross-linking between two cysteines can serve to stabilize monomeric or multi-subunit proteins making it an attractive candidate for protein engineering. When applied to structural-based vaccine design, disulfide bonds can be used to “lock” two flexible regions together, thus preventing a conformational change from prefusion to post-fusion.

Therefore, we sought to determine whether we could predict the probability of obtaining a successful disulfide bond if two residues were mutated to cysteine using HCMV GB as a model protein. Previously, a prefusion stabilized structure was solved by Pfizer (PDBID 7kdp). Pfizer subsequently patented several disulfide bond stabilizing mutations despite their absence in the PDB structure. To determine whether our model could predict a reasonable probability for a disulfide bond, we tested our model against one of these patented mutations, namely Q98C + G271C.

Following feature extraction, the model was used to predict the probability of these two residues “existing” as disulfide bond, or in other words, the probability of forming a disulfide bond when mutated to cysteine, assuming the rest of the protein structure is unperturbed. This resulted in a convincing probability of 0.8. Without considering the Cb:Cb feature, this probability drops to 0.4.


Q98C + G271C (Pfizer)

A potential disulfide forming pair patented by Pfizer for HCMV GB is shown on the right. Because GLY271 has no Cb atom, this residue has been manually changed to alanine and visualized in ChimeraX. The Cb:Cb distance also shown to be 4.540 angstroms. This pair was predicted to have a 0.8 probability of being a disulfide bond, suggesting that mutation to cysteine at each position would result in a successful disulfide bond.

Limitations


Potential pitfalls of this model are numerous. For example, we did not determine whether the structures used in training our model have a high degree of structural nor sequence similarity which could heavily bias the model performance. Ideally, structures with a relatively low degree of structural similarity would be selected such that the broadest range of possible disulfide geometries would be captured. We also did not account for the quality metrics associated with the X-ray structures therefore it is possible that poor quality structural data exists in our training set. A more refined approach to the data mining task may increase the quality of training data.

The limitations of our “window” approach surprisingly limit our dataset. Several issues exist which we were either unable to resolve or were hesitant to implement to avoid introducing bias in our feature selection. For example, if the cysteine occurs proximal to a C or N terminus such that we cannot compute a full 7 residue window, there would be a mismatch in the number of features selected and thus we skip this disulfide pair. We decided to remove the chemical environment features because the positive and negative sample set is derived from the same window and therefore would contain overlapping residues and thus overlapping chemical environments. This is entirely a consequence of the negative sample selection method. If an alternative method were used that did not involve overlapping residues, one hot encoding the chemical environment classifications could serve as additional features. With regards to the interatomic distances between residues, initially we had intended to use all backbone atoms for all residues in the window resulting in 1225 distance features. However, the increased computational time, number of features and the previously described overlapping window problem discouraged us from taking this approach. Instead, we opted for interatomic distances for the cysteine residues participating in the disulfide bond. Disappointingly, it was realized rather late that the 5x5 interatomic distance matrix is symmetric and therefore redundant. A future iteration of the model might involve expanding the interatomic distance calculations to X-1 and X+1 residues for a total of 15 x 15 = 225 features of which 105 are non-redundant. Lastly, a more rigorous set of validation samples is required to accurately gauge model performance.




Chemical Environment Window

As described above, the chemical environment class features were not included in the model due to inconsistencies in the disulfide window. An example of chemical environment counts at each position in the window is provided. Whether data captured here is meaningful is difficult to say, though some patterns to emerge. For example, disulfides are frequently encountered in the X-3 and X+3 positions. This suggests that two disulfides may occur in the same window and hence confounded our ability to effectively utilize the "window" approach.

Residue window

Similar to the chemical environment window, we also determined the frequency (counts) of each amino acid at each position in the window.