Postdoc positions

Boosting Alignment-free Taxonomic Classification of environmental DNA

Contacts:  Fabio PardiEric Rivals   

Summary.  The Methods and Algorithms for Bioinformatics (MAB) team at the computer science department of the University of Montpellier is looking for talented individuals to fill an 18-months postdoctoral position. The successful candidate will work on a novel and promising approach for the identification of the species behind the DNA found in an environmental sample. The originality of the approach lies in the application, within evolutionary analyses, of ideas from string algorithmics that avoid the computationally-intensive task of sequence alignment. A detailed description of the research project can be found below. 

Profile of the candidate.  The applicant should have a background in bioinformatics with an emphasis on algorithmic and software developments for biological sequence analysis and computational genomics. Previous research experience in one of the following topics in bioinformatics will be considered as an advantage: genome assembly, metagenomics, alignment-free sequence analysis, algorithms and data structures for biological strings, statistical molecular evolution, computational phylogenetics. Regardless of her/his background, the applicant’s past research should show a strong motivation towards solving biologically-relevant problems in a rigorous way, using state-of-the-art techniques and/or developing novel methodologies. Good programming skills, as well as an ability for autonomous work, are essential.

Detailed research project                                


High-throughput and inexpensive DNA sequencing has become the norm to analyze a large spectrum of biological samples. Well-known applications range from health care, where diagnostics based on human samples (skin, saliva, feces...) are being implemented in many hospitals, to agro-environmental applications, where the diversity of the microorganisms in the soil, water, plants... has important implications to monitor an ecosystem [1]. Furthermore, recent innovations in DNA sequencing miniaturization ( are pushing DNA-based diagnostics to every layer of society; the analysis of dust from household cleaning [2], or the use of mobile sequencers in academic classrooms [3] are already discussed as future examples of citizen science. Metagenomics is the study of this large variety of environmental DNA samples. Contrary to genomic approaches studying a single genome, metagenomes are generally built from bulk DNA extractions and represent complex biological communities including known and yet unknown species (protists, bacteria, viruses…). The bioinformatic analysis of metagenomes is challenging because most of Earth’s microorganisms (and by extension their genomes) are still poorly known. As a result, only a small fraction of metagenomic sequences (MSs) can be readily assigned to known species, while most of them show poor similarity to reference sequences registered in current databases. De facto, identifying the species at the origin of MSs remains a bottleneck for diagnoses based on human or environmental microbiomes. 

Several classes of bioinformatic/algorithmic approaches are possible for species identification. The choice will generally depend on the biological question and on which speed/precision trade-off is acceptable to answer this question. For instance, the exploration of relatively unknown microbiomes can be done via compositional analysis (frequency of k-mers, i.e words of length k), unsupervised clustering, or rapid pairwise alignments to reference databases [4,5].  The precision of these approaches is limited, with many MSs being assigned to high taxonomic levels (e.g., a sequence is identified as belonging to the Fungi, but the exact species cannot be defined). Nevertheless, their algorithms are often very fast, allowing the analysis of millions of MSs in timescales compatible with large-scale monitoring and everyday diagnostics. If precise classification is desired  for example to identify human or plant pathogens, where adapted treatments can only be administered only after robust identification [6,7] – only advanced phylogenetic modelling is sufficiently resolutive.  However, the standard tools for phylogenetic inference are not applicable to metagenomic datasets, not least because of the sheer number of sequences to analyze (in the millions, while the most efficient methods/infrastructures currently allow inference on tens of thousands of sequences at best).

A solution to this bottleneck is phylogenetic placement [8,9]. The central idea of this technique is to ignore the evolutionary relationships between the MSs and to focus only on elucidating the relationships between the MSs and the known reference genomes. In more detail, a number of reference genomes (or genomic regions) of well-known species are aligned together and a “reference” phylogenetic tree is inferred from the resulting multiple alignment. This reference is the backbone onto which the MSs matching the reference genomes will be “placed”, i.e. assigned to specific edges. (See the figure below.) When a “query MS” (an individual sequence among the MSs) is placed on an edge, this is taken to represent the evolutionary point where it diverged from the phylogeny. Note that when the placement edge of a query MS is not close to a tip of the tree, the interpretation is that the MS does not belong to any reference genome but to a unknown/unsampled genome. This is different from other approaches [10], where a MS assigned to an internal edge usually means that it cannot be confidently assigned to any species of the corresponding subtree. Currently, only two phylogenetic placement methods are widely used, Pplacer [8] and EPA (a component of RAxML) [9]. They are both likelihood-based, i.e. they seek to place each MS in the position that maximises the likelihood of the tree that results from the addition of the MS at that position. 

Previous work at the LIRMM

The VIROGENESIS project was motivated by the need of fast and scalable bioinformatic tools for viral metagenomics that may be used, for example, in everyday medical diagnosis. As members of this project, some of us (B. Linard, F. Pardi) have been working on a new approach for the phylogenetic placement of MSs, which is currently tailored to viral metagenomics; clinical and epidemiological applications have been our primary concern so far.  To accelerate the placement process, our method relies on the construction of a large database of "ancestral" k-mers (noted “AkDB”), i.e. of all the sequences of length k that were present with non-negligible probability in the ancestors of the genomes that are used as a reference for the placement. In addition to these k-mers, the AkDB also contains the k-mers that are possibly present in unknown/unsampled relatives of the reference genomes. AkDB construction, represented on the left in the figure below, uses standard techniques in computational phylogenetics to compute the posterior probabilities of ancestral states (A, C, G, T) along the edges of the phylogeny, given the states observed at the tips of the tree. For each of the k-mers stored in the DB, we also store the edges for which that k-mer was reconstructed, and the posterior probabilities for that k-mer, at each of those edges. AkDB construction is the heaviest computational part in our approach, but it is a preprocessing step that only needs to be executed once, before any metagenome is analysed. The AkDB can then be used multiple times to process new data, each time a new metagenome is sampled (from a new patient, at different timepoints, at different geographical locations…). This is represented in the right part of the figure below.

As for this second phase – in which the user exploits the AkDB for the phylogenetic placement of the query MS – the key idea is that k-mers in each query MS and k-mers in the AkDB are matched to find which edge of the reference tree shows the strongest evidence to be the correct placement. More in detail, for each query MS, a calculation akin to a weighted vote is employed:  each k-mer casts multiple votes – one for each of the edges for which it was reconstructed – and each vote is weighted proportionally to the k-mer probability at the voted edge. The edge receiving the largest total weighted vote is the one on which the query MS is placed. 

Our approach has a number of key computational advantages over competing placement methods: (i) previous likelihood-based methods require that each query MS is aligned to the reference genomes prior to the placement itself, a complex step that relies on techniques such as profile Hidden Markov Models, and that may dominate computation time. In contrast, our approach is alignment-free and does not need to execute this step. (ii) From an algorithmic complexity perspective, runtimes only depend on the size of the query sequence and on the number of edges that are associated to the k-mers appearing in the query – an advantage over likelihood methods, which have runtimes that depend on both the length of the reference sequences and on the size of the reference tree. Our experiments show that, as a result of these advantages, our method is orders of magnitude faster than competing tools. 


An original aspect of our method is that it combines on the one hand (i) an approach common in biological sequence algorithmics — the description of a longer sequence through its constituent k-mers — to (ii) methodologies that are commonly used in statistical and computational phylogenetics — posterior probabilities of ancestral genetic sequences.  Since the principal investigator’s (F. Pardi) background is in phylogenetics, up to now the project has only employed advanced ideas from this field, whereas the ideas that we used for biological sequence algorithmics and storage remain relatively basic. Because of this, we believe that the full potential of our approach remains to be fulfilled. We intend to hire a postdoctoral researcher whose background, still computational, should be mostly complementary to that of the authors meaning that it should centered on algorithms and data structures for biological sequence analysis and genomics. The expertise of E. Rivals lies precisely in those fields. Below we propose a number of lines of work that will help in the development of a robust method for precise phylogenetic classification of metagenomic sequences, which may be employed in large-scale environmental monitoring and everyday medical diagnostics. 

The postdoc will work either (A) on technical extensions based on k-mer algorithmics that will improve the scalability and flexibility of the approach described above, or (B) on the inclusion of new functionalities that may help answer further biological questions, additional to phylogenetic placement. Below we describe some ideas for future work: points 1 and 2 fall roughly under objective A, whereas points 3 and 4 fall roughly under B. The postdoc will be free to form her/his own opinion about the issues identified below, prioritize them, propose solutions for some of them, and implement these solutions

1. Simple data structures are currently used in our program. This can be improved. In order to store the AkDB (currently a hash table), various alternatives for indexing can be contemplated. Memory usage, speed of interrogations, and the desired flexibility of the interrogations (see point 2 below) will be the key factors to consider. Moreover, to improve the robustness of our method to the choice of k, the AkDB could be designed to store k-mers of several different lengths k. 

2. When looking for matches between the query k-mers and the AkDB, our method only allows exact matches, meaning that if we seek a k-mer that is not stored in the AkDB, no record will be returned, even if very similar k-mers are present in the AkDB. Similarly, indels in the reference genome alignment, or in the query MSs are currently problematic for our approach. To solve these issues, an alternative to the current approach is approximate pattern matching (used in bioinformatics, e.g., to align short reads to a genome [14]). 

3. Once a phylogenetic placement for a query MS is identified, our program can also return a mapping of the query to the reference alignment in which each k-mer is placed in the genomic position that maximises its posterior probability at the placement edge. We believe that a biological signal may be extracted from this mapping. The main advantage of the alignments that can be produced in this way is that complex rearrangements can be detected and represented. 

4. When different portions of the same query MS show evidence for different phylogenetic placements, or when a metagenome from a single species shows a concentration of placements on two or more edges of the reference tree, this may be taken as evidence for recombination within or among the query sequences [15], which can be taken as evidence for a recombinant genome. Recombination detection would be an important novel biological application of our method, particularly relevant to viruses, where recombination is widespread and has important epidemiological and evolutionary consequences [16,17]. 

Collaborations. In the past few years we have established a number of collaborations with researchers that are applying our placement method to a number of specific problems. The postdoc will have the opportunity to work directly with our collaborators.

At the CEV in Leuven, Belgium our approach is used for the detection and identification of human pathogenic viruses, such as HIV (at the basis of AIDS) and HCV (causing Hepatitis C). For these viruses, it is often important to precisely identify the type/subtype of the virus infecting a patient, an information which may indicate greater risk of transmission or faster disease progression [20], and help choose among treatment options [21]. Similarly, detecting HIV recombinations is the base to recognize novel circulating recombinant forms (CRFs) [22]. The collaboration with Anne-Mieke Vandamme and her group “Clinical and Evolutionary Virology”, established in the framework of VIROGENESIS, has led to the integration of our method into virus typing tools that are being developed by the bioinformaticians in Leuven. 

At the BGPI, the team of Philippe Roumagnac is involved in the characterization of the viral diversity in a number of agro-ecosystems, such as the shrubland (fynbos) in the region of the Western Cape in South Africa, the three agro-climatic zones of Burkina Faso, the wetlands in Camargue, and the humid tropical environment of Réunion Island. Inventorying and characterizing the plant-associated viruses naturally occurring in these environments is the first step towards better understanding the emergence of diseases affecting cultivated plants in these regions, and by extension reduce the viral disease burden on crop yields. Many diseases harmful to agriculture are in fact caused by infections of pathogens originating from the environment surrounding the cultivated fields [23]. While these pathogens may be harmless in the original environments, they may cause significant damage to agricultural plants. We note that plant pathogens are a great threat to global food security [24], causing the loss of up to 16% of the global harvest annually [25]. 

For both of the above partners, a crucial interest lies in the detection of recombinant viruses. Viral recombination is the process leading to the emergence of viruses whose genome combines elements of RNA/DNA from different viral strains. Recombination plays a key role in the evolution of the virulence of a virus [17], and in its adaptation to particular hosts and host tissues (host tropism) [16]. As we have explained above, our approach is well-adapted to detecting MSs that show evidence of recombination. The BGPI is currently collaborating on this subject with Darren Martin of the University of Cape Town (South Africa), who is the author of RDP [15] (Recombination Detection Program), the leading software for recombination detection in the world. The postdoc will be strongly encouraged to exploit this collaboration.


1. Baird DJ, Hajibabaei M. Biomonitoring 2.0: a new paradigm in ecosystem assessment made possible by next‐generation DNA sequencing. Mol Ecol. 2012;21: 2039–2044.

2. Gilbert MTP. Documenting DNA in the dust. Mol Ecol. 2017;26: 969–971.

3. Zaaijer S, Columbia University Ubiquitous Genomics, Erlich Y. Using mobile sequencers in an academic classroom. eLife. 2016;5. doi:10.7554/eLife.14258

4. Sedlar K, Kupkova K, Provaznik I. Bioinformatics strategies for taxonomy independent binning and visualization of sequences in shotgun metagenomics. Comput Struct Biotechnol J. 2017;15: 48–55.

5. Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 2016;17: 132.

6. Butel M-J. Probiotics, gut microbiota and health. Med Mal Infect. 2014;44: 1–8.

7. Elena SF, Fraile A, García-Arenal F. Evolution and Emergence of Plant Viruses. Adv Virus Res. 2014;88: 161-191.

8. Matsen FA, Kodner RB, Armbrust EV. pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinformatics. 2010;11: 538.

9. Berger SA, Krompass D, Stamatakis A. Performance, accuracy, and Web server for evolutionary placement of short sequence reads under maximum likelihood. Syst Biol. 2011;60: 291–302.

10. Huson DH, Mitra S, Ruscheweyh H-J, Weber N, Schuster SC. Integrative analysis of environmental sequences using MEGAN4. Genome Res. 2011;21: 1552–1560.

11. Allman ES, Rhodes JA, Sullivant S. Statistically Consistent k-mer Methods for Phylogenetic Tree Reconstruction. J Comput Biol. 2017;24: 153–171.

12. Marçais G, Pellow D, Bork D, Orenstein Y, Shamir R, Kingsford C. Improving the performance of minimizers and winnowing schemes. Bioinformatics. 2017;33: i110–i117.

13. Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014;15: R46.

14. Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10: R25.

15. Martin DP, Lemey P, Lott M, Moulton V, Posada D, Lefeuvre P. RDP3: a flexible and fast computer program for analyzing recombination. Bioinformatics. 2010;26: 2462–2463.

16. He C-Q, Xie Z-X, Han G-Z, Dong J-B, Wang D, Liu J-B, et al. Homologous recombination as an evolutionary force in the avian influenza A virus. Mol Biol Evol. 2009;26: 177–187.

17. Worobey M, Rambaut A, Holmes EC. Widespread intra-serotype recombination in natural populations of dengue virus. Proc Natl Acad Sci USA. 1999;96: 7352–7357.

18. Gambette P, van Iersel L, Jones M, Lafond M, Pardi F, Scornavacca C. Rearrangement moves on rooted phylogenetic networks. PLoS Comput Biol. 2017;13: e1005611.

19. Pardi F, Scornavacca C. Reconstructible Phylogenetic Networks: Do Not Distinguish the Indistinguishable. PLoS Comput Biol. 2015;11: e1004135.

20. Bhargava M, Cajas JM, Wainberg MA, Klein MB, Pant Pai N. Do HIV-1 non-B subtypes differentially impact resistance mutations and clinical disease progression in treated populations? Evidence from a systematic review. J Int AIDS Soc. 2014;17: 18944.21.

21. Häggblom A, Svedhem V, Singh K, Sönnerborg A, Neogi U. Virological failure in patients with HIV-1 subtype C receiving antiretroviral therapy: an analysis of a prospective national cohort in Sweden. Lancet HIV. 2016;3: e166–174.

22. Ng KT, Ong LY, Takebe Y, Kamarulzaman A, Tee KK. Genome sequence of a novel HIV-1 circulating recombinant form 54_01B from Malaysia. J Virol. 2012;86: 11405–11406.

23. Jacques M-A, Arlat M, Boulanger A, Boureau T, Carrère S, Cesbron S, et al. Using Ecology, Physiology, and Genomics to Understand Host Specificity in Xanthomonas. Annu Rev Phytopathol. 2016;54: 163–187.

24. Strange RN, Scott PR. Plant Disease: A Threat to Global Food Security. Annu Rev Phytopathol. 2005;43: 83–116.

25. Chakraborty S, Newton A. Climate change, plant diseases and food security: an overview. Plant Pathol. 2011;60: 2–14.