chimeres

On the evaluation of the fidelity of supervised classifiers in the prediction of chimeric RNAs

1 Sacha Beaumeunier, 1Jerome Audoux, 1Anthony Boureux, 1,2Therese Commes, 1Florence Fruffle, 1Nicolas Philippe, 2,3,4 Ronnie Alves∗

1Institut de Medecine Regeneratrice et de Biotherapie, INSERM U1183, CHU Montpellier, Montpellier, Fr.

2Institut de Biologie Computationnelle, Universite Montpellier, Montpellier, Fr.

3 Laboratoire d’Informatique, de Robotique et de Microelectronique de Montpellier, Universite Montpellier, UMR 5506 CNRS, Montpellier, Fr.

4PPGCC, Universidade Federal do Pará, Belém, Br.

∗Corresponding author: alvesrco @ gmail[dot]com

Background: High-throughput sequencing technology and bioinformatics have identified chimeric RNAs (chRNAs), raising the possibility of chRNAs expressing particularly in diseases can be used as potential biomarkers in both diagnosis and prognosis.

Results: The task of discriminating true chRNAs from the false ones poses an interesting Machine Learning (ML) challenge. First of all, the sequencing data may contain false reads due to technical artifacts and during the analysis process, bioinformatics tools may generate false positives due to methodological biases. Moreover, if we succeed to have a proper set of observations (enough sequencing data) about true chRNAs, chances are that the devised model can not be able to generalize beyond it. Like any other machine learning problem, the first big issue is finding the good data to build models. As far as we were concerned, there is no common benchmark data available for chRNAs detection. The definition of a classification baseline is lacking in the related literature too. In this work we are moving towards benchmark data and an evaluation of the fidelity of supervised classifiers in the prediction of chRNAs.

Conclusions: We have developed a benchmark pipeline incorporating a genome mutation process and simulated RNA-seq data by Flux Simulator. These sequencing reads within distinct depth were aligned and annotated by CRAC. CRAC offers a new way to analyze the RNA-seq data by integrating genomic location and local coverage, allowing biological predictions in one step. Additionally, these reads were functionally annotated and aggregated to form chRNAs events, making it possible to evaluate classifiers performance in both levels of reads and events. The resulting data were used as a benchmark for several comparison analysis. Ensemble learning strategies demonstrated to be more robust to this classification problem, providing an average AUC performance of 95% (ACC=94%, Kappa=0.87%). The resulting classification models were successfully applied on real RNA-seq data from a set of twenty-seven patients with acute myeloid leukemia (AML).

Datasets

We have developed a benchmark pipeline (chimeres.pdf) using the above genome simulation procedure along with i) the generation of corresponding RNA-seq reads by Flux Simulator and ii) labelling chRNAs classes through the CRAC tool. A total of five Human (GRCh38) mutated genomes were generated. They were used in a progressive sampling scheme for the machine learning comparisons analysis. For each mutated genome, 40 millions of paired-end reads with read length of 100bp were generated. The first run (r1) has one genome and progressively one more were added to the sample up to having all five genomes in the fifth run (r5).

Real data

  • RNA-seq data (Illumina HiSeq 2000 Homo sapiens) from a set of twenty-seven patients having acute myeloid leukemia. The goals of this study are to obtain a comprehensive study of mutations and gene expression in human acute myeloid leukemia (AML). All sequence data are freely available at (GEO) database within the accession number GSE49642.

Simulation data at read level

Simulation data at event level

  • Six independent data sets (e_chRNAs.zip) (3 RNA-seq 100bp + 3 RNA-seq 200 bp), corresponding to 1634 chimeric events (1257 = False and 377 =True)

Classification baseline

We have adopted a progressive sampling strategy to better understand the impacts of adding more observations to the classifiers performance. We first started using one simulated data, and next more data were added sequentially up to having all five genomes. To each run we have set aside one third of the data for the test phase while the remaining two third were used for model building. Training were performed with a 10-fold cross validation scheme along with a tuning grid to each ML technique. Additionally, given that the data split (training/test) were random, we repeated each run three times to check models stability. Thus, having a total of 15 performance points. Models’ performance were average to each run (r1 to r5).

At read level

BioData-Mining 2016

We proposed a modelization strategy that can be used to increase the tools performances in context of chRNA classification based on a simulated data generator, that permit to continuously integrate new complex chimeric events. The pipeline incorporated a genome mutation process and simulated RNA-seq data. The reads within distinct depth were aligned and analysed by CRAC that integrates genomic location and local coverage, allowing biological predictions at the read scale.

BEAUMEUNIER, S.; AUDOUX, J.; BOUREUX, A.; COMMES, T.; FRUFFLE, F.; ALVES, R. Use of simulated data sets to evaluate the fidelity of chimeric RNA classifiers.

BIOKDD-DEXA 2015

A short introduction about this project (The Role of Machine Learning in Finding Chimeric RNAs) was presented (slides.pdf) at the 6th International Workshop on Biological Knowledge Discovery and Data Mining (BIOKDD'15), Held in paralel with the 26th International Conference on Database and Expert Systems Aplications (DEXA’15). Best Paper Award at BIOKDD-DEXA'15

DOI: http://dx.doi.org/10.1109/DEXA.2015.25

JOBIM 2015

A poster was presented at the French Bioinformatics Conference (JOBIM'15). Rôle de l'apprentissage automatique dans le problème de détection d'ARN chimères. (Poster_jobim.pdf).

The benchmark pipeline.