Ph.D Research Project

My main project at Prof. Pavel Pevzner's lab is dedicated to genome assembly from long error-prone reads such as PacBio and Oxford Nanopores. Specifically, I work on assembly of tandem repeats — such regions of the genome where a certain pattern is repeated several times with limited variation (in some cases hundreds or even thousands of times). Extra long tandem repeats (ETRs) are the most challenging for assembly algorithms and range in length from tens of thousands to millions of nucleotides. With very limited diversity between repeat copies the assembly resembles a giant puzzle of a clean sky with very few cirrus clouds!


Even though the first draft assembly of the human genome dates back to almost 20 years ago, hundreds of gaps persist in the modern reference assembly. The unassembled regions include ETRs and very long segmental duplications. Human centromeres and immunoglobulin loci are examples of functionally important regions that remain partially or fully unassembled.


Centromeres are among the longest tandem repeats in the human genome and the biggest gaps in the reference human genome assembly. Centromeres play critical roles in chromosome segregation and a large component of genetic disease results from aneuploidies arising during meiosis. In addition, variations in centromeres are linked to cancer and infertility.


Immunological loci are responsible for biosynthesis of antibodies — the key components of our immune systems that protect us from various pathogenic threats. Antibodies can be used for designing drugs targeting cancer and other diseases. Although development of such drugs involves many computational bottlenecks, there was no software for antibody sequencing and no algorithms for analyzing immunogenomics data until recently. Computational immunogenomics emerged only in 2011 at the border of immunology and computing and is one of the fastest growing disciplines in life science.


To close all remaining gaps in the human genome the recently established Telomere-to-Telomere (T2T) consortium aims to generate its first complete assembly. Generation of the first complete assembly of a human chromosome by the T2T consortium was a prominent stepping stone to study unknown variation of unexplored consequence in complex repetitive regions. I recently joined the T2T consortium and my goal is to approach all unassembled ETRs in the human genome and assess their variability population-wide.


We just submitted a manuscript describing centroFlye tool — the first algorithm for centromere assembly from long error-prone reads. We applied this tool for automatic reconstruction of the centromere on chromosome X and provided evidence that the previous manual centromere X reconstruction by the T2T consortium has missed a large fraction of centromere X resulting in errors that were fixed in our new assembly. Analysis of centroFlye assembly revealed that human X chromosome is partitioned into well defined repeat subfamilies and provides initial insights into centromere evolution.