We have used biological and physicochemical features for each of the protein sequences. Each protein sequence (from the training datasets) was evaluated for a total of 20 biological and 9,154 physicochemical features. The biological features were annotated using various bioinformatics tools and the physicochemical features were annotated using the ProtR package.
List of 20 biological features of proteins that were used in this study. We used well-known bioinformatics tools to evaluate various properties of the protein sequences in the training datasets. We downloaded these tools from their respective websites for local installation and applied them to the bacterial, protozoan, viral, and fungal datasets.
*Represents cut-offs derived from the literature or the default threshold.
Gardy, J.L., Spencer, C., Wang, K., Ester, M., Tusnady, G.E., Simon, I., Hua, S., deFays, K., Lambert, C., Nakai, K. and Brinkman, F.S., 2003. PSORT-B: Improving protein subcellular localization prediction for Gram-negative bacteria. Nucleic acids research, 31(13), pp.3613-3617.
Chaudhuri, R., Ansari, F.A., Raghunandanan, M.V. and Ramachandran, S., 2011. FungalRV: adhesin prediction and immunoinformatics portal for human fungal pathogens. BMC genomics, 12(1), p.192.
Petersen, T.N., Brunak, S., Von Heijne, G. and Nielsen, H., 2011. SignalP 4.0: discriminating signal peptides from transmembrane regions. Nature methods, 8(10), pp.785-786.
Nielsen, M., Lundegaard, C., Lund, O. and Keşmir, C., 2005. The role of the proteasome in generating cytotoxic T-cell epitopes: insights obtained from improved predictions of proteasomal cleavage. Immunogenetics, 57(1-2), pp.33-41.
Hofmann, K.A.W.S., 1993. TMbase-A database of membrane spanning proteins segments. Biol. Chem. Hoppe-Seyler, 374, p.166.
Gasteiger, E., Hoogland, C., Gattiker, A., Wilkins, M.R., Appel, R.D. and Bairoch, A., 2005. Protein identification and analysis tools on the ExPASy server. In The proteomics protocols handbook (pp. 571-607). Humana press.
Larsen, M.V., Lundegaard, C., Lamberth, K., Buus, S., Lund, O. and Nielsen, M., 2007. Large-scale validation of methods for cytotoxic T-lymphocyte epitope prediction. BMC bioinformatics, 8(1), p.424.
Andreatta, M. and Nielsen, M., 2016. Gapped sequence alignment using artificial neural networks: application to the MHC class I system. Bioinformatics, 32(4), pp.511-517.
Emanuelsson, O., Nielsen, H., Brunak, S. and Von Heijne, G., 2000. Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. Journal of molecular biology, 300(4), pp.1005-1016.
Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J., 1990. Basic local alignment search tool. Journal of molecular biology, 215(3), pp.403-410.
Emanuelsson, O., Nielsen, H. and Heijne, G.V., 1999. ChloroP, a neural network‐based method for predicting chloroplast transit peptides and their cleavage sites. Protein Science, 8(5), pp.978-984.
Xiao, N., Cao, D.S., Zhu, M.F. and Xu, Q.S., 2015. protr/ProtrWeb: R package and web server for generating various numerical representation schemes of protein sequences. Bioinformatics, 31(11), pp.1857-1859.
Result of bioinformatics tools used to evaluate the properties of 20 biological features for 574 bacterial proteins.
List of 11 biological features selected after filtering for the bacterial dataset. The 20 biological features considered initially were filtered out using Welch’s T-test. Only 11 biological features were finally selected. Welch’s T-test was conducted to determine the p-value of all the features. Only the features with a p-value of less than 0.05 were selected as filtered features for subsequent parts of our study.
Result of ProtR package used to calculate the values of 9154 physicochemical features for the bacterial dataset. The physicochemical properties were computed using various programs present in ProtR for 574 protein sequences in the training dataset
Result of ProtR used to calculate the values of shortlisted 1436 physicochemical features (properties) for the bacterial dataset. Out of 9154 physicochemical properties, we found that 1436 properties emerged as significant (p <0.05; Welch’s T- Test).
Subsequently, we combined 11 biological features and 1436 physicochemical properties (1447 features) for subsequent analysis.
The biological and physicochemical features were computed for the protozoan, viral, and fungal systems. The results of which are as follows:
Result of RV and ProtR (combined) used to calculate the values of shortlisted 2074 features (properties) for the protozoan dataset.
Result of RV and ProtR used to calculate the values of shortlisted 1754 features (properties) for the viral dataset.
Result of RV and ProtR used to calculate the values of shortlisted 2801 features (properties) for the fungal dataset.