Ivo Gut, Centre Nacional d'Anàlisi Genòmica, Barcelona

Is somatic mutation calling a solved problem? – Experience of the ICGC somatic mutation calling benchmark

Abstract: Somatic mutations in cancer genomes are determined by comparing whole-genome or exome DNA sequencing of a tumour with a normal sample. Identified mutations from massively-parallel sequencing experiments are usually verified by a perpendicular technology. Early on it became apparent that verification rates were lower than would be desirable. Based on this the ICGC project initiated a series of benchmarks in which the practices and outputs of different centres are compared. The benchmarks cover whole-genome sequencing and sequence data analysis. The insights are that there are huge discrepancies both of how sequencing is being performed by different laboratories even though they us the same basic instrumentation and that pipelines developed by different groups give very different results even when they are incorporating the same pieces of software.

Ana Conesa, Institute of Computational Genomics, Valencia and  University of Florida

Integration of multi-omics data to study B-cell differentiation: experimental design and analysis issues

Abstract: Next generation sequencing has speed up genome analysis and brought omics research closer to many organisms and biological scenarios. Today an increasing number of research projects propose the combined use of different omics platforms to investigate diverse aspects of genome functioning. These proposals ideally seek to provide complementary sources of molecular information that eventually can be put together to obtain systems biology models of biological processes. Hence, it is not rare anymore to find experimental designs involving the collection of genome, transcriptome, epigenome and even metabolome data on a particular system. However, standard methodologies for the integration of diverse omics data types are not yet ready and researchers frequently face post-experiment question on how to combine data of different nature, variability, and significance into an analysis routine that sheds more light than the analysis of individual datasets separately. In the STATegra we have set out to address these questions using a mouse B-cell differentiation system as model for integrative multi-omics analysis.  We have generated up to 7 different omics measurements that go from chromatin features, through gene expression, to the cellular metabolism.  We will show how to investigate and assess differences in the nature of the data provided by the different omics technologies. We propose different integrative methods to study the dynamics of cell differentiation and to study the regulatory programs that control different stages of the process. Finally I will demonstrate some of the software tools develop by STATegra that are available to the scientific community to analyze multi-omics data.

Claire Lemaitre, IRISA Rennes

Reference-free detection of genomic variants: from SNPs to inversions

Abstract: Assessing the genetic differences between individuals within a species, or between chromosomes of an individual, is a fundamental task in many aspects of biology. Classical approaches for variant detection require a reference genome and rely on a first mapping step. With sequencing costs falling, sequencing efforts are no longer limited to the main species of interest and biologists are increasingly working on data for which they do not have any close reference genome, or any good quality reference assembly. We present here an approach that is reference-free and that avoids the resource intensive and difficult task of de novo assembly. Instead, polymorphisms of interest are directly searched in the raw read datasets, as topological motifs in the de Bruijn Graph. We present two software that implement this approach for several types of variants. DiscoSnp++ detects single nucleotide polymorphisms (SNP) and small indels from any number of read set(s) and TakeABreak is focused on structural variants such as inversions. Both software run faster than any assembly-based or mapping-based approaches, and have a tiny memory footprint enabling their usage on simple desktop machines.

Magnus Rattray, University of Manchester

Gaussian process modelling of omic time course data

AbstractWe are developing methods based on Gaussian process inference to analyse data from high-throughput biological time course data. Applications range from classical statistical problems such as clustering and differential expression through to systems biology models of cellular processes such as transcription and it's regulation. Our focus is on developing tractable Bayesian methods which scale to genome-wide applications. I will describe our approach to a number of problems: (1) non-parametric clustering of replicated time course data; (2) inferring the full posterior of the perturbation time point from two-sample time course data; (3) inferring the pre-mRNA elongation rate from RNA polymerase ChIP-Seq time course data; (4) uncovering transcriptional delays by integrating pol-II ChIP-Seq and RNA-seq time course data through a simple differential equation model.

Jean-Jacques Codani, Biofacet Paris

bfmir: a fast and accurate method for identifying micro-RNAs from NGS reads in plants

Abstract: Coming soon

Sandrine Dudoit, University of California Berkeley

Identification of Novel Cell Types Using Single-Cell Transcriptome Sequencing

Abstract: Single-cell transcriptome sequencing (scRNA-Seq), which combines high-throughput single-cell extraction and sequencing capabilities, enables the transcriptome of large numbers of individual cells to be assayed efficiently. Profiling of gene expression at the single-cell level for a large sample of cells is crucial for addressing many biologically relevant questions, such as, the investigation of rare cell types or primary cells (e.g., early development, where each of a small number of cells may have a distinct function) and the examination of subpopulations of cells from a larger heterogeneous population (e.g., discovering cell types in brain tissues). 

I will discuss some of the statistical analysis issues that have arisen in the context of a collaboration funded by the Brain Research through Advancing Innovative Neurotechnologies (BRAIN) Initiative, with the aim of classifying neuronal cells in the mouse somatosensory cortex. These issues, ranging from so-called low-level to high-level analyses, include: exploratory data analysis (EDA) for quality assessment/control (QA/QC) of scRNA-Seq reads, normalization to account for nuisance technical effects, cluster analysis to identify novel cell types, and differential expression analysis to derive gene expression signatures for the cell types. 

Mark Robinson, University of Zurich

Comparison of methods to detect changes in isoform usage with RNA-seq data

Abstract: RNA sequencing (RNA-seq) has vastly expanded the molecular biologist's toolbox to study and characterize gene expression in wide array of experimental conditions and species.  In a single assay, access to information on novel isoforms, abundance, isoform structure, RNA editing and allele-specificity is realized.  However, there are many challenges in interpreting and processing this type of data.  This talk will focus on the problem of discerning changes in isoform usage between experimental conditions using bioinformatics approaches.  There are many ways to tackle the problem and therefore many methods and software packages already exist; however, it is not clear what approaches perform well.  In this talk, I will give an overview of the methods available to unravel changes in isoform usage and highlight our recent work using simulations to understand their relative merits; we will make this benchmark available to the community so that new methods can use it for assessment.

Esko Ukkonen, University of Helsinki

Discovery of transcription factor binding motifs from large sequence sets

AbstractThe position weight matrix (PWM) model of transcription factor (TF) binding sites specifies a multinomial distribution of sequences that has only one dominating seed sequence. To make the model more accurate, one can use several seeds and also utilize the fact that the TFs not only bind to DNA but also to each other, forming dimeric regulatory complexes. By including dimerization it is possible to obtain models that have potentially stronger capability to explain the regulation of gene expression. The talk will describe recent developments in modeling and predicting regulatory complexes, using models that are mixtures of monomeric and dimeric PWMs, and are learned from large sequence sets produced by high-throughput SELEX experiments. (Joint work with J. Taipale, T. Kivioja, P. Rastas, A. Jolma and J. Toivonen.)

Paul Medvedev, Pennsylvania State University

Contig assembly: problem formulation, algorithms, and implementation

AbstractThe problem of genome assembly is to reconstruct the sequence of a genome given a set of reads.  A common problem formulation is to find an either an edge-covering walk in a de Bruijn graph or node-covering walk in a string graph, where such walks represent the source genome.  However, as there can be many such optimal walks in a graph due to repeats, assemblers do not actually attempt to output the source genome.  Instead, they output "contigs," or regions that are guaranteed to be in the genome.  In this walk, we formulate the contig assembly problem as an alternative to traditional formulations, and presents algorithms that are, in a sense, optimal for this problem.  We may also discuss algorithms to reduce the memory usage of de Bruijn graph construction as well as lower bounds on the memory needed.

Michael Brudno, University of Toronto

(Computationally) Solving Rare Disorders

Abstract: Gene mutations cause not only well-recognized rare diseases such as muscular dystrophy and cystic fibrosis, but also thousands of other rare disorders. While individually rare, these disorders are collectively common, affecting one to three percent of the population. The last several years have seen the identification of hundreds of novel genes responsible for rare disorders, and an even greater number of cases where a known gene was implicated in a new disease.

 In this talk I will describe the computational approaches that are required to make this identification possible, and describe the tools that we (and others) have developed to enable clinicians to diagnose their patients by analyzing the patient’s genomes and sharing de-identified patient data.

Franck Picard, LBBE Lyon

Poisson Functional regression for the analysis of next generation sequencing data

Abstract: Next Generation Sequencing experiments have now become standard for the analysis of genome-wide molecular phenomenon. The characteristics of produced data is their volume, and when mapped on a reference genome, they are spatially organized along chromosomes. Another particularity of these data is their discrete nature as they are made of counts usually modeled by Poisson of Negative Binomial distributions. To account of the spatial structure, Hidden Markov models have been considered to detect reads enrichments, but the flexibility of such approaches is limited, as well as the performance to detect sharp enrichments in the data. Here we develop a Poisson functional regression framework based on wavelets to analyze mapped NGS data. This framework is very flexible as it allows a multiscale representation of the signal, and allows for the introduction of replicates and covariates for normalization purposes. The statistical challenge lies in the selection of wavelet coefficients which depends on penalty weights that need to be calibrated. Here we provide data-driven weights for the Lasso and the group-Lasso derived from concentration inequalities adapted to the Poisson case. We show that the associated Lasso and group-Lasso procedures are theoretically optimal in the oracle approach. Our method is then illustrated on the identification of replication origins in the human genome.

Karel Brinda, LIGM Marne-la-Vallée

RNF: a general framework to evaluate NGS read mappers

AbstractRead simulators combined with alignment evaluation tools provide the most straightforward way to evaluate and compare mappers. Simulation of reads is accompanied by information about their positions in the source genome. This information is then used to evaluate alignments produced by the mapper. Finally, reports containing statistics of successful read alignments are created. In default of standards for encoding read origins, every evaluation tool has to be made explicitly compatible with the simulator used to generate reads.

In order to solve this obstacle, we have created a generic format RNF (Read Naming Format) for assigning read names with encoded information about original positions. Furthermore, we have developed an associated software package RNFtools containing two principal components. MIShmash applies one of popular read simulating tools (among DWGSIM, ART, MASON, CURESIM etc.) and transforms the generated reads into RNF format. LAVEnder evaluates then a given read mapper using simulated reads in RNF format. A special attention is payed to mapping qualities that serve for parametrization of ROC curves, and to evaluation of the effect of read sample contamination.

Gregory Kucherov, LIGM Marne-la-Vallée

Spaced seeds improve k-mer-based metagenomic classification

AbstractMetagenomics is a powerful approach to study genetic content of environmental samples that has been strongly promoted by NGS technologies. To cope with massive data involved in modern metagenomic projects, recent tools (Kraken, LMAT) rely on the analysis of k-mers shared between the read to be classified and sampled reference genomes. Within this general framework, we show that spaced seeds provide a significant improvement of classification capacity as opposed to traditional contiguous k-mers. We support this thesis through a series a different computational experiments, including simulations of large-scale metagenomic projects.

Nicolas Servant, Institut Curie

HiC-Pro: An optimized and flexible pipeline for Hi-C data processing

AbstractHiC-Pro is an optimized and flexible pipeline to process Hi-C data from raw reads to normalized contact maps. HiC-Pro maps reads, detects valid ligation products, generates and normalizes intra- and inter-chromosomal maps. It includes a fast implementation of the ICE normalization method and is based on a memory-efficient data format for Hi-C contact maps. We applied HiC-Pro in its parallel mode on the largest Hi-C dataset currently available (1.5 billion paired-end reads), demonstrating its ability to process very large data in a reasonable time. Source code, annotation files and documentation are available at http://github.com/nservant/HiC-Pro

Céline Lévy-Leduc, AgroParisTech/INRA

Two-dimensional segmentation approaches for analyzing HiC data

AbstractThe spatial conformation of the chromosome has a deep influence on gene regulation and expression. HiC technology allows the evaluation of the spatial proximity between any pair of loci along the genome. It results in a data matrix where blocks corresponding to interacting regions appear. The delimitation of such blocks is critical to better understand the spatial organization of the chromatin. From a computational point of view, it results in a 2D-segmentation problem. We shall propose in this talk several approaches for dealing with this issue. The performance of the proposed methods will be assessed on synthetic and public real data.

Nelle Varoquaux, Mines ParisTech

A statistical approach for inferring the 3D structure of the genome

AbstractThe spatial and temporal organization of the 3D organizations of chromosomes is thought to have an important role in genomic function, but is yet poorly understood. Recent advances in chromosomes conformation capture (3C) technologies, initially developed to assess interactions between specific pairs of loci, allow to simultaneously measure contacts on a genome scale, paving the way to the reconstruction of the full 3D model of a genome.

Inferring the 3D structure remains however a challenging problem. Many approaches converts interaction frequencies into physical distances and solves constrained optimization problem (often non convex) akin to multidimensional scaling (MDS). Recent works have proposed probabilistic models of interaction frequencies and their relationships with physical distances, and uses MCMC sampling procedures to produce an ensemble of 3D structures.

We propose a new formulation of the inference as a maximum likelihood model based on a statistical model of interaction frequencies where the 3D structure is latent variable. We demonstrate our approach reconstructs better structures than previous MDS based methods, particularly at low coverage and high resolution.

François Rechenmann, GenoStar

Integrated software for variant analysis from NGS data

AbstractGenostar develops a software application dedicated to the fined-grain analysis and comparison of bacterial genomes. This application associates i) a database for collecting and storing genotypic and phenotypic data, ii) a set of methods for analyzing, screening and comparing NGS data, and iii) several interactive data viewers.

In the context of a partnership with Fondation Mérieux in Lyon, this application is experimented for the follow-up of tuberculosis patients. After sequencing the clinical strains, the reads are stored in the database and connected to the clinical data of the associated patients. A workflow of analysis methods (read assembly, mapping on a reference M. tuberculosis sequence, SNPs screening, and so on) is executed on the raw data in order to identify and characterize SNPs. The application allows, for a given patient, to follow over time the emergence of antibiotics resistance markers. Moreover, by crossing the phenotypic data with the genotypic data, new markers can also be discovered and stored in the database once validated.

The core of the application is a set of NGS data processing methods. Several of them have been specifically designed and developed in the course of this project.

The application has been designed to be implemented for other pathogens. For example, first experiments on M. leprae data will be carried out in collaboration with EPFL (Lausanne) in fall 2015. Other uses in biotechnology processes are also expected, typically to evaluate the stability of bioproduction strains.

Valentina Boeva, Institut Curie

SV-Bay: structural variant detection in cancer genomes using a Bayesian approach with correction for GC-content and read mappability

AbstractWhole genome sequencing of paired-end reads can be applied to characterize the landscape of large somatic rearrangements of cancer genomes. Several methods for detecting structural variants with whole genome sequencing data have been developed. So far, none of these methods has combined information about abnormally mapped read pairs connecting rearranged regions and associated copy number changes. Our aim was create a computational method that could use both types of information, i.e., normal and abnormal reads, and demonstrate that by doing so we can highly improve both sensitivity and specificity rates of structural variant prediction.

We developed a computational method, SV-Bay, to detect structural variants from whole genome sequencing mate-pair or paired-end data using a probabilistic Bayesian approach. This approach takes into account depth of coverage by normal reads and abnormalities in read pair mappings. To estimate the model likelihood, SV-Bay considers GC-content and read mappability of the genome, thus making important corrections to the expected read count. For the detection of somatic variants, SV-Bay makes use of a matched normal sample when it is available. We validated SV-Bay on simulated datasets and an experimental mate-pair dataset for the CLB-GA neuroblastoma cell line. The comparison of SV-Bay with several other methods for structural variant detection demonstrated that SV-Bay has better prediction accuracy both in terms of sensitivity and false positive detection rate.

The method is available at https://github.com/InstitutCurie/SV-Bay

Alice Cleynen & Stéphane Robin, AgroParisTech/INRA MIA

Comparing change-point location in RNAseq data

AbstractNext generation sequencing technologies provide precise estimates of the boundaries of transcribed regions. This enables us both to revise genome's annotation and to study the variation of these boundaries when conditions vary. The latter problem resumes to the comparison of change-point locations in the segmentation of multiple series. We consider a Bayesian framework with conjugate priors to perform exact inference on the change-point model. When comparing two series, we derive the posterior credibility interval of the shift between the locations. When comparing more than two series, we compute the posterior probability for a given change-point to have the same location in all series. All calculations are made in an exact manner in a quadratic time. This work is motivated by the comparison of transcript boundaries in yeast grown under different conditions. In this case, our approach reveals different behavior between internal and external exon boundaries.

Elsa Bernard, Mines ParisTech

A convex formulation for joint RNA isoform detection and quantification from multiple RNA-seq samples

AbstractDetecting and quantifying isoforms from RNA-seq data is an important but challenging task. The problem is often ill-posed, particularly at low coverage. One promising direction is to exploit several samples simultaneously. We propose a new method for solving the isoform deconvolution problem jointly across several samples. We formulate a convex optimization problem that allows to share information between samples and that we solve efficiently. We demonstrate the benefits of combining several samples on simulated and real data, and show that our approach outperforms pooling strategies and methods based on integer programming. Source code is available at http://cbio.ensmp.fr/flipflop.

Aurélie Teissandier, Institut Curie

Mapping transposons in heterogeneous NGS data

AbstractTransposable elements represent around 40% of mammalian genomes. Over the course of evolution, these sequences have provided beneficial innovations and have participated to genome shaping and speciation.  However, in the short term, these elements contribute to genome instability by altering gene organization and expression. Transposons represent a significant challenge for next generation sequencing analyses. Notably, the majority of sequence reads derived from these elements map to multiple positions in the genome, preventing unambiguous conclusions about their origin. In this study, we address the bioinformatic challenges of transposon analysis and discuss several strategies to investigate their expression and regulation in mouse embryonic stem  cells in RNA-seq, ChIP-seq and WGBS data.