Base model:
With only 70.28% accuracy.
Some hyper-parameters:
n_estimators: 1000
max_features: 'auto'
max_depth: None
min_samples_split: 2
min_samples_leaf: 1
bootstrap: True
Aforementioned hyper-parameters were optimized with random search: {1900, 'sqrt', 240, 2, 6, True}.
With the optimized hype-parameters, there is only a very slight increase of accuracy from 70.28% to 71.02%. Both sensitivity and specificity have increased slightly.
Base model:
With only 71.35% accuracy.
Some hyper-parameters:
C: 1.0
kernel:'rbf'
probability: False
Aforementioned hyper-parameters were optimized with random search: {10, 0.001, 'rbf'}.
With the optimized hype-parameters, there is only a very slight increase of accuracy from 71.35% to 72.02%.
Base model:
With only 66.47% accuracy.
Some hyper-parameters:
n_estimators: 200
learning_rate:1
base estimator max depth: 1
base estimator min samples leaf: 1
Aforementioned hyper-parameters were optimized with random search: {1000, 1, 9, 15}.
With the optimized hype-parameters, there is only a very slight increase of accuracy from 66.47% to 66.53%.
Base model:
With only 71.82% accuracy.
Some hyper-parameters:
C: 1.0
max_iter: 500
penalty: L2
solver: 'lbfgs'
Aforementioned hyper-parameters were optimized with random search: {0.1, 500, 'L1', 'liblinear'}.
With the optimized hype-parameters, there is only a very slight increase of accuracy from 71.82% to 72.49%. Sensitivity has increased slightly with compromising a little specificity.
Base model:
With only 62.12% accuracy.
Some hyper-parameters:
n_neighbors: 5
leaf_size: 30
weight: 'uniform'
p: 2
Aforementioned hyper-parameters were optimized with random search: {46, 20, 'distance', 2}.
With the optimized hype-parameters, there is only an increase of accuracy from 62.12% to 67.34%.
Random Forest (left: Precision-Recall, right: ROC)
Support Vector Machine (left: Precision-Recall, right: ROC)
Decision Tree with AdaBoost (left: Precision-Recall, right: ROC)
Logistic Regression (left: Precision-Recall, right: ROC)
K-Nearest Neighbors (left: Precision-Recall, right: ROC)
Interferon (IFN)-stimulated cells have been used to study ISGylation targets. In previous study, 190 sites were identified in IFN-treated porcine cells with ISG15-IP and the standard bottom-up proteomics. In another study, 614 ISGylation sites were identified in Listeria-infected mouse liver and spleen with DiGly proteomics. It was shown transfection of HEK293T cells with the four components (Ube1L, UbcH8, hHerc5 and ISG15) for ISG15 conjugation yielded more robust ISGylation. From the transfected cells, we have collected so far the largest dataset with 3733 ISGylation sites identified.
Unlike ubiquitination process which has about 600 ligases, ISGylation is catalyzed by a single major ligase, Herc5 in humans. However, we did not identify a sequence motif of ISGylation targets. Similarly, no strong enrichment for any specific motif was found from a ubiquitination dataset where about 19,000 ubiquitination sites were identified with DiGly proteomics. On the other hand, a conserved sequence motif (Ψ-K-x-D/E) has been discovered for SUMOylation, a post-translational protein modification with another ubiquitin-like modifier (SUMO). Not very different from ISGylation, substrate sepcificity of SUMOylation relies on a single E2, SUMO-conjugating enzyme (Ubc9). Nevertheless, statistical tests of the ISGylation dataset show that hundreds of properties are distributed significantly different between the ISGylation targets and non-targeted lysine residues. With these features, five machine learning models were trained to predict ISGylation targets among different lysine residues in this project.
With five different optimized models, we achieved an accuracy between 67% and 72%. However, machine learning prediction tools for phosphorylation have achieved between 88-92% accuracy. There are also prediction tools for ubiquitination with more than 80% accuracy.
Logistic regression showed highest accuracy, and random forest and support vector machine had only slightly lower accuracy. Decision tree with AdaBoost and KNN models achieved an accuracy less than 70%. These two models also showed lower sensitivity and specificity than the other models.
While RF, SVM and LR got very similar accuracy level, they had different performances as to their predicting sensitivity and specificity. Random forest had the highest power to correctly identify as many ISGylation targets as possible from the positive set. On the other hand, support vector machine and logistic regression models were relatively better at finding non-ISGylation sites from the negative sets.
Even though the prediction power of these models is limited relative to the ones predicting phosphorylation and ubiquitination, their current performance already indicates that ISGylation of nascent polypeptides by human Herc5 is not a stochastic process. Understanding how these features determine the parameters of the prediction models may help understand the mechanism of how Herc5 functions.
The limited performance may be due to the fact that ISGylation targets and non-targets do not differ to the extent that they can be clearly separated. Another possibility is the best features that are able to distinguish the positive and negative datasets have not been discovered in our current analysis. Even though a few features were found to be significantly different from the two datasets, including the site scores, hydrophobicity, frequency of forming alpha-helix and site scores, the distribution of these features between positive and negative datasets are greatly overlapped.
Deep learning models may be an alternative and even better option to build the ISGylation prediction models. Different deep learning strategies have been used to predict E3-specific lysine ubiquitination sites and phosphoglycerylated lysine residues and good performance has been achieved. In future work, with better features and multiple deep learning based techniques, it may be possible to have a model with good predicting power to determine the probability if a lysine residue is a ISGylation target.