PacBio HiFi reads are now changing the world of genome assembly. The combination of length and accuracy is a powerful tool that enabled the first assemblies of centormeres in human genome. Existing algorithms for HiFi reads are mostly inherited from long read assemblers that rely on overlap graphs. However the accruracy of HiFi reads enables application of de Bruijn graph based methods. My goal is to create assembly algorithms for HiFi reads based on de Bruijn graph.
PacBio and ONT reads are extremely powerful tools for repeat resolution because their length allows one to bridge most repeats and thus connect entrances to corresponding exits. However repeats longer than read length can not be bridged. This project is devoted to resolving such repeats based on divergences between repeat copies. We already succeeded in resolving many complex repeats in bacterial datasets generated by Sanger institute as a part of NCTC sequencing project and aim to apply our methods to metagenomes and mammalian genomes.
Vertebrate genomes project succeeded in generating comprehensive assemblies of multiple vertebrate species. However many complex but biologically important regions remain fragmented in these assemblies. One of such regions is immunoglobulin loci that often contains very long repeats that contain many genes that regulate immune system including V, D, and J genes that are building blocks of antibodies. Only a very accurate assembly of immunoglobulin loci can reveal all these genes and enable a large scale comparative analysis of immunity in vertebrate species. We have already shown that even existing PacBio reads can be used to assemble immunoglobulin heavy chain locus in stoat genome. My goal is to revisit assembly of each genome published by VGP project and resolve repeats in Ig loci using divergencies between repeat copies.
Our collaborators from the T2T consortium generated high coverage of human genome by ultra long PacBio and ONT reads as well as PacBio HiFi reads and assembled them using a combination of automatic assembly tools (Canu, Flye, HiCanu) and manual analysis using additional sequencing technology (10X, Hi-C, and optical mapping. As the results T2T consortium generated by far the most contiguous and accurate human genome assembly including centromeres and telomeres that resisted assembly efforts until now. However the automated assembly of all human chromosomes remains an open problem that is critical for scaling the T2T project to 1000s of genomes. My goal is to resolve the remaining repeats in the human genome based on the divergences between various repeat copies.
Since assembling long repetitive DNA fragments is a challenging problem, 16S rRNAs are hardly ever resolved in metagenomic datasets. I aim to reconstruct accurate 16S rRNA genes using divergence between 16S rRNAs in various bacterial species. This project is a collaboration with Rob Knight's lab at UCSD and a part of an effort to analyze human microbiome and to sequence new candidate phyla (“dark bacterial matter”) by analyzing various metagenomic datasets.
Many long bacterial genes are split between several contigs during metagenomic assembly. Thus they are ignored by gene prediction algorithms that take contigs as an input. However these genes still form paths in the de Bruijn graph and with a little knowledge of their structure we can extract these lost potential genes from the graph. We already succeeded in reconstructing many novel CRY genes that are very important for agriculture as natural pesticides and aim to apply this method to other gene families such as CRISPR.
This project is a collaboration with Keith Turner from Monsanto Company.