To mine data regarding disulfide geometry, we gathered a list of high-resolution X-ray diffraction PDB structures with the keyword “disulfide”. The following query was run against the PDB database:
QUERY:
Full Text = "disulfide" AND (Experimental Method = "X-RAY DIFFRACTION" AND Polymer Entity Type = "Protein" AND Refinement Resolution = [ 2 - 2.5 ] AND ( Release Date = [ 01/01/2005 - 12/31/2009 ] OR Release Date = [ 01/01/2010 - 12/31/2014 ] OR Release Date = [ 01/01/2015 - 12/31/2019 ] OR Release Date = [ 01/01/2020 - 12/31/2024 ] ) )
The following Python packages were used in our analysis:
PyRosetta
Numpy
Pandas
Scipy
Sklearn
Seaborn
A PDB file is a text file containing the 3D structure of a protein and provides atomic coordinates. PDB structures are read into PyRosetta for interrogation of protein geometry. While PyRosetta provides an interface to the Rosetta molecular modeling suite for structural prediction, manipulation and energetic calculations, we use it solely to mine the local protein geometry surrounding disulfide bonds.
Figure 2. Schematic of the pipeline for creating the model described herein. Data is mined from the protein data banked, then characterized by a number of structural feature prior to being inserted into out disulfide formation algorithm.
The feature generation algorithm performs the following:
For each PDB ID in the query list, the corresponding PDB file is downloaded. A pose object is created using PyRosetta. This object contains the chain, residue and atomic coordinates of interest. For each residue in the protein sequence, if the residue is cysteine, the index of the residue is stored in a dictionary of cysteine IDs (cys_ids). Pairs of cysteines are found by determining whether each combination of cysteines is within 3 angstroms. If the cysteines are within 3 angstroms, then they are said to be forming a disulfide bond and the residue indices for these cysteines are stored as a pair (ds_unique_pairs). Likewise, a dictionary of disulfide pairs is created where one residue serves as the key and its disulfide bonding partner is the value. To determine whether the local geometry and chemistry of residues flanking the disulfide bond may influence their formation, a window of six residues, three on each side of each cysteine, is extracted (Xi-3, Xi-2, Xi-1, CYS, Xi+1, Xi+2, Xi+3).
The residue ID at each position is matched with a classification of polar, hydrophobic, charged, or disulfide to gain a statistical view of the frequency of each amino acid and chemistry at each position. Initially, for feature selection, we had aimed to use the pairwise interatomic distances between all backbone atoms (Ca, Cb, C, O, N) in pair of windows (where each window is of length 7) as features leading to 25x49 = 1225 interatomic features. Because glycine does not have a Cb atom, it was mutated to alanine if found in the window. Unexpectedly, many disulfides occurred proximal to the N or C-terminal ends and therefore window size mismatches limited this calculation, and it was abandoned for a simple pairwise atomic distance model for each cysteine alone. Thus, there are 25 pairwise interatomic features for each pair of cysteines forming a disulfide bond. Dihedral angles, phi and psi for each disulfide were also added as features. Thus, each disulfide bond is represented by 27 features. These consist of our “positive” sample population (class labels = 1 for disulfide).
The question of how to logically construct a representative “negative” sample population proved to be much more difficult in practice than in theory. A random selection of residues with Cb:Cb distances greater than some threshold was avoided because it’s likely that the features calculated would be practically insignificant from an engineering perspective. In practice, residues are selected a priori based on their proximity to one another. Therefore, we reasoned it would be useful to select a “negative” population of residue pairs that existed in close proximity without participating in a disulfide bond. However, selecting this population proved challenging because it would involve calculating pairwise distances for all combinations of residues for all 3000 proteins. Ultimately, we deferred to a literature search on similar computational approaches and found an example where residues with the shortest Ca:Ca distance in a window of: [Xi-1, CYS, Xi+1 ], flanking each cysteine participating in a disulfide bond were selected as the “negative” population. We implemented the same approach, but with a window size of 7 as described above. The positive and negative populations were calculated and stored in separated Pandas dataframes before final concatenation. Classification labels relating to the cysteine residue indices, PDBID were also stored in the dataframe. A total of 3000 PDB structures were extracted resulting in 3088 samples with 1549 unique disulfide bonds and 1539 non-disulfide bonds.