I am currently a Principal Genomics Data Scientist at Genomics England. I am building statistical analyses frameworks and conducting association studies of genetic and phenotypic data of the largest high-coverage whole genome dataset in the world (>100,000 people in early 2020).
In Lausanne as a member of the Department of Computational Biology and the Swiss Institute of Bioinformatics, I developed new Bayesian methodology to improve prediction of complex traits and diseases. I used this methodology to analyse vast databases containing genetic and phenotypic information for hundreds of thousands of individuals (e.g., UK biobank).
Research was conducted at the Human Evolutionary Genetics (HEG) unit at Institut Pasteur Paris as a Roux-cantarini independent fellow.
This project involved the analysis of 300 exomes of African Pygmies and African Bantu-speaking farmers together with a PhD student that I co-supervised. I inferred their demographic history using a coalescent composite likelihood approach (fastsimcoal2) and found that they experienced a recent bottleneck and expansion, respectively. However we found no differences between populations in their DFE or their count of deleterious alleles and homozygotes. Our results are in line with theoretical expectations and are explained in the context of high effective sizes in the past of hunter-gatherers and strong admixture with farmers more recently. These results are about to be submitted as a paper.
The HEG lab has generated massive expression data for 200 individuals of European and African descent of human blood cells that have been exposed to different pathogenic stimuli (flu, bacterial and viral proteins). At the same time, these individuals have been geneotyped and exome-sequenced. Analysis of these data led to discovery of thousands expression Quantitative Trait Loci (eQTLS). I am currently investigating the extent that the effect of eQTLs on gene expression and the amount of expression variation that they explain is related to the selective constraint on the coding and non-coding regions of genes. I want to quantify the relationship between effects on fitness and effects size of variants on absolute levels of expression and expression response to stimuli.
For this project I collaborated with Carla Saleh and Vanesa Mongelli, members of the Viruses and RNAi Unit at the Department of Virology, Institut Pasteur in Paris. This project involved the analysis of sequencing data of an RNA virus evolution experiment involving the Drosophila C virus. This work involved direct collaboration with a molecular biology experimental group and had a truly inter-disciplinary nature.
During my PhD studies I realized that in order to realistically be able to tackle the complexity of evolutionary interactions in the genome we need sufficiently complex models. However, inference for complex models is often difficult since the likelihood function is intractable. A promising emerging methodology involves approximate inference through simulations (Approximate Bayesian Computation; ABC). This is the reason that I decided to join the group of Daniel Wegmann, a leading expert in ABC, in Fribourg Switzerland. With Daniel Wegmann we tackled several research questions in diverse fields:
My first application of ABC was part of a an international collaboration with Nadia Singh at the University of North Carolina, USA to tackle the invasion history of a devastating crop pest, Drosophila Suzukii. Our research showed that D. suzukii has most likely independently invaded the rest of the continents from Asia. We also found with simulation that large, genome-wide datasets are needed to discriminate between different invasion histories.
Several variants of ABC have been developed including coupling ABC with MCMC (ABC-MCMC), which has been show to be faster than standard ABC and also lead to more accurate posterior estimations. However, existing ABC methods don't handle well high-dimensional models with many parameters (typically more than ten). I developed together with D. Wegmann and a collaborating mathematician, Christoph Leuenberger, a new ABC-MCMC method that is optimized for analysis of high dimensional models. We also applied our new ABC-MCMC method to jointly infer the distribution of fitness effects of new mutations (DFE), selection coefficients per segregating allele and demography based on time-series allele frequency data. Using this method we analysed time-series data obtained from experimental evolution of the Influenza virus under antiviral drug treatment. Our analysis provided insights into the shape of the DFE and revealed several positively selected mutations close to the interaction center between key viral proteins and the drug.
In the Wegmann group we developed new methods that call genotypes and estimate heterozygosity based on low-coverage data from single individuals while accounting for post-mortem damage and sequencing error. As part of a collaboration with a large consortium and together with a PhD student that I co-supervised, we used this set of methods to analyse whole-genome data generated for samples of individuals that lived in the neolithic period in Greece, Anatolia and Iran. These analyses revealed that agriculture spread genetically rather than culturally to Europe from Anatolia through Greece and that agriculture in East Asia and Europe did not originate from the same population.
During my PhD studies at the Institute of Evolutionary Biology, University of Edinburgh in UK, I took part in the analysis of a whole genome polymorphism dataset from wild house mice (Mus musculus castaneus) to infer natural selection and the demographic history of this species. More specifically, I contrasted the rate of adaptive evolution between protein-coding and non-coding DNA and between autosomal and X-linked genes. I also developed and compared new models to infer the distribution of fitness effects of new mutations (DFE) and applied these models to the analysis of genome-wide datasets from wild house mice and Drosophila melanogaster.