geneRFinder: gene finding in distinct metagenomic data complexities
In: (work in progress)
Author: Silva, R.; Padovani, K. ; Goes, F. R. ; Alves, R.
Abstract: We provide a novel, comprehensive benchmark data for gene prediction --- which is based on The Critical Assessment of Metagenome Interpretation (CAMI) challenge, and contains labeled data from gene regions --- and we also introduce geneRFinder, a ML-based gene predictor able to outperform state-of-the-art gene prediction tools across this benchmark. Average prediction rates of geneRFinder differed in percentage terms by 54% and 64%, respectively, against Prodigal and FragGeneScan while handling high complexity metagenomes. The specificity rate of geneRFinder had the largest distance against FragGeneScan, 79 percentage points, and 66 more than Prodigal. According to McNemar's test, all percentual differences between predictors performances are statistically significant for all datasets with a 99% confidence interval.
Predicting Genes in Energetic Metabolisms
In: (work in progress)
Author: Silva, R.; Padovani, K.; Alves, R.
Opening the Black Box: Understanding Features For Gene Prediction
In: (work in progress)
Author: Silva, R.; Alves, R.
A Random Forest Classifier For Prokaryotes Gene Prediction
In: 8thBrazilian Conference On Intelligent Systems (Bracis), 2019, Salvador, Ba. Bracis Proceedings, 2019.
Author: Silva, R.; Padovani, K.; Goes, F. R. ; Alves, R.
Abstract: Metagenomics is related to the study of microbial genomes, known as metagenomes, describing them through their microorganisms compositions, relationships and activities, thus allowing a greater knowledge about the fundamentals of life and the broad microbial diversity. One way to accomplish such task is by analyzing information from genes contained in metagenomes. The process to identify genes in DNA sequences are usually called gene prediction. This work presents a new gene predictor using the Random Forest classifier. The proposed model obtaining better classification results when compared to state-of-the-art gene prediction tools widely used by the bioinformatics community. Random Forest presented more robust results, being 27% better than Prodigal and 20% better than FragGeneScan w.r.t AUC values while using the independent test set. Feature engineering has been revisited in the gene prediction problem, reinforcing the importance of careful evaluation of assembly a good feature set. K-mer counting features can been seen as the fundamental model building blocks to develop robust gene predictors.
Link: drive.google.com/open?id=1gdOOvyXPMmM7uAv5W2PwPKq1VHHyCMSy
Sequence Binning Prior to Metagenome Assembly: A Case Study
In: Brazilian Symposium On Bioinformatics, 2018, Rio De Janeiro.
Author: Oliveira, P. ; Padovani, K.;Silva, R.; Alves, R.
Abstract: This work, through an empirical study, aimed to answer the following question: Does binning over reads contribute to the production of better assemblies? We evaluated whether quantitative (genome binning) and qualitative (taxonomic binning) approaches bring benefits to the assembly of genomes from metagenome data through statistics which evaluate assemblies considering their sizes and qualities.
Link: drive.google.com/open?id=1cc5TKjFgApR1peklljI85EZOcRl6ANox
Training Set Composition Analysis for Machine Learning Evaluation Applied to Gene Prediction
In: Brazilian Symposium On Bioinformatics, 2018, Rio De Janeiro.
Author: Silva, R.; Padovani, K.; Santos, W. ; Xavier, R. ; Alves, R.
Abstract: Metagenomics allows the study of microbial communities, known as metagenomes, describing them through their compositions and the relation and activities of the microorganisms that coexist there, thus allowing a deeper knowledge about the fundamentals of life and about the broad microbiological diversity, which is still poorly known. Such description can be achieved by the analysis of information from genes contained in (meta) genomes, extracted through the process of identifying genes in DNA sequences, called gene prediction. This work presents a study that allows the analysis of the impact of the training set composition when using machine learning in protein-coding genes prediction.
Link: drive.google.com/open?id=1BIMlLADvHAjFaYSaYytqmi6uZd9I2Odh