Genome Data Science

(AGENDAS)


at the Institute for Research in Biomedicine (IRB Barcelona).


Are you good with numbers? Interested in genomics? Want to help us crack the code of cancer?

We're looking for motivated postdocs and/or students. Feel free to get in touch!

A deluge of genomic, transcriptomic and phenomic data presents vast opportunities to learn about the properties of living systems, but it also presents challenges.

In order to answer outstanding questions in biology and medicine, researchers need to discover meaningful and robust patterns from data. Doing so, they face of the (lack of) structure, the complexity and the massive size of omics data sets.

Data scientists must therefore use their extensive computational know-how and harness a variety of statistical and machine learning methods in order to arrive from data to biological insight.

funded by ERC StG "HYPER-INSIGHT"

"An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem." -- John Tukey

In the aGENDAS group, we strive to elucidate the links between mutational processes, natural selection, gene function and phenotype by means of genome analyses. In particular, we use cutting-edge computational techniques and statistical/machine learning methodologies for analyses of massive genomic data sets.

We aim to answer important biological questions by insightful analysis of data originating from human cancers (somatic mutations, chromosomal alterations, transcriptomes), human populations (germline variants), metagenomics (including human microbiomes) and also fully sequenced microbial genomes.

Group members:

Fran Supek

group leader

@megaNarF



Josep Biayna

postdoc

@JBiaynaRodrguez

David Mas-Ponte

PhD student

@davidmasp

Jurica Levatić

postdoc

Marina Salvadores

MSc student

Albert Lahat

postdoc

~~~~

¿ you ?

motivated postdoc/PhD candidates, feel free to get in touch

~~~~

The research interests of the aGENDAS group are organized into four themes:

[1] MUTATION

Unraveling mutational processes. Mutations are the fuel of carcinogenesis and it is imperative to learn what causes them and how they drive evolution in general, and cancer evolution in particular. We have shown that somatic mutations are unevenly distributed across the human genome due to differential activity of DNA mismatch repair (MMR), which preferentially protects gene-rich regions (Supek & Lehner 2015 Nature). Moreover, motivated by the discoveries of APOBEC3 mutagenesis in tumors, we found another prevalent process that creates clustered mutations in many cancer types -- error-prone MMR, evident as the mutational signature of DNA polymerase eta (POLH). In addition, the histone mark H3K36me3 is an important determinant of both the standard, error-free MMR and the non-canonical, error-prone MMR (Supek & Lehner 2017 Cell).

[2] SELECTION

Genomic signatures of natural selection. Most somatic mutations found in cancer cells are ‘passengers’ , with little phenotypic consequence. Detecting the few mutations among those which are ‘drivers’ is challenging, yet crucial to understand carcinogenic transformation. We have previously discovered that synonymous mutations ie. those that occur in gene coding regions but do not change the amino acid sequence, commonly drive cancer by affecting splicing patterns of oncogenes (Supek et al. 2014 Cell). Moreover, we have learnt how the quality control pathway of nonsense-mediated mRNA decay (NMD) decides which mRNAs to degrade (Lindeboom et al. 2016 Nat Genet), and used these rules of NMD to reveal patterns of positive and negative selection on tumor suppressor genes.

Cancer genome sequencing has revealed considerable variation in somatic mutation rates across the human genome, with mutation rates elevated in heterochromatic late replicating regions and reduced in early replicating euchromatin. Multiple mechanisms have been suggested to underlie this, but the actual cause is unknown. Here we identify variable DNA mismatch repair (MMR) as the basis of this variation. Analysing ~17 million single-nucleotide variants from the genomes of 652 tumours, we show that regional autosomal mutation rates at megabase resolution are largely stable across cancer types, with differences related to changes in replication timing and gene expression. However, mutations arising after the inactivation of MMR are no longer enriched in late replicating heterochromatin relative to early replicating euchromatin. Thus, differential DNA repair and not differential mutation supply is the primary cause of the large-scale regional mutation rate variation across the human genome.
Premature termination codons (PTCs) cause a large proportion of inherited human genetic diseases. PTC-containing transcripts can be degraded by an mRNA surveillance pathway termed nonsense-mediated mRNA decay (NMD). However, the efficiency of NMD varies; it is inefficient when a PTC is located downstream of the last exon junction complex (EJC). We used matched exome and transcriptome data from 9,769 human tumors to systematically elucidate the rules of NMD targeting in human cells. An integrated model incorporating multiple rules beyond the canonical EJC model explains approximately three-fourths of the non-random variance in NMD efficiency across thousands of PTCs. We also show that dosage compensation may sometimes mask the effects of NMD. Applying the NMD model identifies signatures of both positive and negative selection on NMD-triggering mutations in human tumors and provides a classification for tumor-suppressor genes.

[3] FUNCTION

Automated inference of gene function. Genome sequencing technologies are rapidly advancing, providing an abundance of genomes of prokaryotic and eukaryotic species, and also of populations thereof. This presents an opportunity to learn about the function of the ~1/3 of the genes for which, remarkably, a biological role is still not known. We have devised a methodology to infer gene function from evolution of codon biases, and experimentally validated tens of predictions in E. coli (Krisko et al. 2014 Genome Biol). We have also investigated how best to combine heterogeneous genomic predictors, finding that it often pays off to simply trust a single most confident call, even if not supported in multiple methods (Vidulin et al. 2016 Bioinformatics).

[4] PHENOTYPE

Genetic basis of phenotypes. Various kinds of -omics data accumulate rapidly and are increasingly organized into tidy, structured repositories. In contrast, phenomics data, while very valuable, are less often collected in a systematic manner and encoded in computable formats. This hampers the discovery of genes that underlie various phenotypes. We have used machine learning to text-mine the scientific literature and annotate microbias species with >400 phenotypic traits (Brbic et al. 2016. Nucl Acids Res) and suggest their genetic basis (including prevalent epistasis in gene repertoires). One example are genomes of pathogenic bacteria, which tend to encode proteomes resistant to unfolding, thereby protecting the microbes from oxidative stress (Vidović et al. 2014 Cell Rep).

Fran Supek - highlighted publications:

  • Clustered Mutation Signatures Reveal that Error-Prone DNA Repair Targets Mutations to Active Genes. F Supek, B Lehner (2017) Cell. Mutation clusters in cancer genomes provide fingerprints of mutagenic mechanisms // Error-free mismatch repair lowers the mutation rate in H3K36me3-marked active genes // Error-prone repair using POLH also targets H3K36me3, contributing driver mutations // UV and alcohol increase error-prone repair, targeting mutations toward active genes.
  • The rules and impact of nonsense-mediated mRNA decay in human cancers. RGH Lindeboom, F Supek*, B Lehner* (2016) Nature Genetics. (*corresponding authors) Matched exome and transcriptome data can systematically elucidate the rules of NMD targeting in human tumors, explaining ¾ of the variance in NMD efficiency. Applying our NMD model identifies signatures of positive and negative selection on nonsense mutations in human tumors and provides a classification for tumor-suppressor genes.
  • Differential DNA mismatch repair underlies mutation rate variation across the human genome. F Supek, B Lehner (2015) Nature. Somatic mutation rates exhibit tissue-specificity coupled to regional changes in DNA replication timing and gene expression. A temporal deconvolution of mutational signatures in microsatellite-instable tumors of the colon, stomach and uterus demonstrates that post-replicative MMR is the cause of the megabase-scale mutation rate variability in the human genome.
  • Synonymous mutations frequently act as driver mutations in human cancers. F Supek, B Miñana, J Valcárcel, T Gabaldón, B Lehner (2014) Cell. Enrichments of somatic mutations indicate that ~1 in 5 synonymous mutations in oncogenes are cancer drivers. Involvement in known exonic splicing motifs and association to RNA-Seq data implicates many causal synonymous mutations to altered splicing. The 3’ UTRs of dosage-sensitive oncogenes also harbour causal mutations. ~ Covered in a ‘preview’ article in Cell. ~
  • Inferring gene function from evolutionary change in signatures of translation efficiency. A Krisko, T Copic, T Gabaldón, B Lehner, F Supek (2014) Genome Biology. The changes in codon adaptation in orthologous gene families can systematically predict function of many genes by employing machine learning to rule out confounding variables. We have experimentally validated novel roles in adaptation to environmental stressors (oxygen, heat, salinity) for tens of E. coli genes.
  • The landscape of microbial phenotypic traits and associated genes. M Brbić, M Piškorec, V Vidulin, A Kriško, T Šmuc, F Supek. (2016) Nucl Acids Res. We have systematically annotated >3,000 prokaryotic taxa with >400 phenotypes, while drawing on comparative genomics and text mining techniques. This reveals thousands of gene families causally involved in various microbial traits, as well as pervasive epistasis that has shaped gene repertoires of these organisms.

"All models are wrong, but some models are useful." -- George E.P. Box (1979)