PacBio HiFi reads are now changing the world of genome assembly. The combination of length and accuracy is a powerful tool that enabled the first assemblies of centormeres in human genome. Existing algorithms for HiFi reads are mostly inherited from long read assemblers that were optimized to handle large error rates of PacBio CLR and nanopore reads. The focus of assembly algorithms should instead be on reliably assembling complex regions, which have previously been considered impossible to reconstruct. We are developing a new approach for assembly of HiFi reads that in rooted in theoretical problem of optimal genome assembly, which features transparancy of decisions, enabling easier curation and verification of assembly results.
Vertebrate genomes project succeeded in generating comprehensive assemblies of multiple vertebrate species. However many complex but biologically important regions remain fragmented in these assemblies. One of such regions is immunoglobulin loci that often contains very long repeats that contain many genes that regulate immune system including V, D, and J genes that are building blocks of antibodies. Only a very accurate assembly of immunoglobulin loci can reveal all these genes and enable a large scale comparative analysis of immunity in vertebrate species. We have already shown that even existing PacBio reads can be used to assemble immunoglobulin heavy chain locus in stoat genome. My goal is to revisit assembly of each genome published by VGP project and resolve repeats in Ig loci using divergencies between repeat copies.
Investigating viral genomes is inherently challenging: each sample is a mosaic of multiple viral strains, blurring the link between genotype and phenotype. This task becomes even more difficult in low-complexity regions, which are common in viral genomes and notoriously hard to resolve. Recent advances in highly accurate HiFi sequencing, however, change this landscape. HiFi reads make it possible to disentangle alleles even in these problematic regions, enabling direct investigation of their associations with virulence and other key viral traits.
PacBio and ONT reads are extremely powerful tools for repeat resolution because their length allows one to bridge most repeats and thus connect entrances to corresponding exits. However repeats longer than read length can not be bridged. This project is devoted to resolving such repeats based on divergences between repeat copies. We already succeeded in resolving many complex repeats in bacterial datasets generated by Sanger institute as a part of NCTC sequencing project and aim to apply our methods to metagenomes and mammalian genomes.
Many long bacterial genes are split between several contigs during metagenomic assembly. Thus they are ignored by gene prediction algorithms that take contigs as an input. However these genes still form paths in the de Bruijn graph and with a little knowledge of their structure we can extract these lost potential genes from the graph. We already succeeded in reconstructing many novel CRY genes that are very important for agriculture as natural pesticides and aim to apply this method to other gene families such as CRISPR.
This project is a collaboration with Keith Turner from Monsanto Company.