3730 ISGylation sites were identified with DiGly proteomics (positive set).
From these ID'd proteins, lysine residues which are not modified with ISG15 were extracted as non-ISGylated sites (negative set).
Some non-ISGylated sites are identified as ISGylation sites on other proteins. Those sites were excluded from the negative set.
We end up with 75060 non-ISGylated sites.
To solve the data imbalance issue, non-ISGylated sites were clustered into 15 clusters with GibbsCluster2.0.
Undersample sites from each cluster based on their size. So the final negative set contains 3730 sites.
AAIndex1 database were downloaded from (https://www.genome.jp/aaindex/)
Database was parsed in Python and 76 relevant indices were selected. Some of them include 'Hydrophobicity index ', 'Residue volume ', 'Amphiphilicity index ', etc.
Each index of each amino acid around the site becomes a feature of the site as shown in the following figure. i.e. residue volume of amino acid at position +1, etc.
It results in 1064 features of a site.
SPIDER3 was downloaded from (https://sparks-lab.org/downloads/)
Each protein (where positive and negative set are generated) was run in the SPIDER3 toolbox where secondary structure prediction and backbone torsion angles calculation of every amino acid were done. These properties of each amino acid around the site becomes one feature of that site.
For example, torsion angle 'theta' of amino acid at position -3 is a feature of this site.
It results in 75 features of a site.
Analysis shows that the ligase, Herc5 prefers lysine residues near the N-terminus to those near the C-terminus as shown in the left figure. A feature called 'relative_position' was created for the lysine residue in each site. i.e. if the lysine is in the first 10% amino acids from the N-terminus, then the value is 0.1, etc.
The heatmap on the right shows that chances to modify a lysine residue by Herc5 is affected by the amino acid composition flanking the site. It's striking that positively charged amino acids are significantly depleted at the N-terminal side of the lysine while positively charged residues at some positions are enriched in the analysis. Negatively charged amino acids show the opposite. This matrix was also used as a scoring matrix to score each site.
7by7 sequence were also extracted as features of one site. For example, amino acid at position +3 is one feature of the site.
Herc5 prefers lysine residues near the N-terminus to those near the C-terminus. This result is consistent with the cotranslational model.
The heatmap shows the amino acid preference of each residue within -7 tp +7 sequence flanking the site of modification with ISG15. Enrichment is shown in red and depletion in blue and represents a log(10) p-value for each amino acid.
In the AAIndex1 database, property values of some amino acids are missing in several indices.
When sites do not have all 7 upstream or downstream amino acids, a placeholder 'Z' was assigned for the alignment. But SPIDER3 toolbox cannot predict secondary structure or backbone torsion angles for these placeholders.
Missing values in the AAIndex1 were imputed with the average value of other amino acids
Missing 'theta' angles: 180; missing 'phi' angles: -180, missing psi and tau angles: 0. Missing secondary structure: 'None' since it's a categorical variable.
At this stage, each site has been assigned with 1141 features.
'SelectKBest' function from the 'sklearn.feature_selection' was used to find the features that are significantly different between the positive and negative datasets.
For each feature, chi2 test between positive and negative datasets was performed. 331 of the total 1141 features differ significantly between the two datasets.
Relative difference of each feature between the two datasets were calculated. The relative differences include lots of zeros, which is possibly due to the default chi2 test; the median relative difference is only 21%. Features with relative difference greater than 21% were kept for further analysis which contains 171 features.
How ISGylation is correlated with some of these features are shown in the following figures.
'FAUJ880110 (*)': Number of full nonbonding orbitals with amino acid at position -2, -4 and -1.
'NADH010107 (*)': hydropathy scale based on self-infomation values in the two state model. Score1&2: site scores. 'SUYM030101 (*)': linker propensity index.
'EISD840101 (*)': consensus normalized hydrophobicity scale at -6, -5, -4 ,-3 -2 and -1 positions.
'MAXF760104 (*): normalized frequency of left-handed alpha-helix at +7, +5 and +1 positions.
Feature "relative_position" was normalized with MinMaxScaler from sklearn module.
Other numeric features were standardized with StandardScaler from sklearn module.
Categorical features including secondary structure and amino acid composition were one-hot encoded with the get_dummies() function in the pandas module.
Dimensionality reduction was performed with PCA from the sklearn module.
Positive dataset and negative dataset are NOT separated from each other with two or three principal components (PC).
First 134 PCs contribute to 99% variance of the dataset. Sites with the 134 PCs were exported as the input of machine learning models.
Positive and negative datasets are not separated with PC1 and PC2. (axises annotation: 0 -> PC1, 1 -> PC2)
Positive and negative datasets are still not separated with PC1, PC2 and PC3.
The first 3 PCs only explain for about 11.3% variance of the whole dataset (positive and negative sets).
(only show the first 175 PCs in the figure)
With first 100 PCs, 95% variance of the dataset can be maintained. With the first 134 PCs, we can keep 99% variance of the entire dataset.
Loadings of the first two PCs with 5 highest weights were listed in the following tables.
Dataset before PCA dimensionality reduction were analyzed with t-SNE and UMAP.
t-SNE was done with TNSE() function from the sklearn.manifold module. It was run with the parameters: n_component=2, perplexity = [10, 30, 50, 100], n_iter = [100, 200, 500, 1000].
Positive and negative datasets are not separated with two t-SNE components..
UMAP was done in R with the umap library. It was run with the parameters: n_neighbours=15, metric=euclidean, local_connectivity=1.
Positive and negative datasets are not separated with two UMAP components.
Axises: t-SNE component1 and t-SNE component2.
Labels: salmon red -> positive set, turquoise -> negative set
Axises: V1 -> UMAP component1, V2 -> UMAP component2
Labels: blue -> positive set, orange -> negative set
Dataset were split into training set and test set (8:2) with the train_test_split() function from the sklearn.model_selection module.
Train random forest / RandomForestClassifier(), support vector machine, logistic regression and decision tree with AdaBoost models with the default parameters.
Optimize the hyper-parameters by random search with the RandomizedSearchCV module from sklearn.model_selection module.
Train those models again with the .best_params_.
Assess the result in terms of the accuracy, sensitivity and specificity.
3730 positive sites and 3730 negative sites (with 7by7 seq) were extracted.
Changes of number of features: 1141(numeric) + 15(categorical) ---(feature_selection)---> 331(numeric) + 15(categorical) ---(feature_selection)---> 171(numeric) + 15(categorical) ---(normalization) ---> 171(numeric) + 21*14 (onehot encoded AA composition) + 1*4(onehot encoded secondary structures) ---(PCA)---> 134PCs(features).
Positive and negative data cannot be separated with two or three components of PCA, t-SNE or UMAP.
In spite of the results of PCA, the dataset will be used to train several types of machine learning models.