Research

Direction 1: Structural variation detection

I develop tools to detect structural variations in human genomes via different sequencing technologies. Structural variations are variations in DNA occurring to >50 bases. It is called "structural" because variations occurring to a large number of bases may change the structure of protein. Detecting structural variations is an important bioinformatics topic because these variations may lead to genetic diseases such as cancer. I devoted near a decade of time to this area and here I list a few tools that I developed or participated in the development.

  • HySA - combining Illumina and PacBio reads to detect insertions and deletions; indicating the optimal coverage combination of the two sequencing technologies for best sensitivity and specificity. The tool is available on bitbucket.
  • OMIndel - using Optical Maps to detect large insertions and deletions; indicating Optical Maps are complementary to existing sequencing-based technologies in 1) complex events; 2) repetitive regions. The tool is available on bitbucket.
  • BreakDancer - utilizing discordant read pairs in Illumina to detect deletion, insertion, inversion, intra and inter-chromosomal translocations. The tool is available on bitbucket. A newer version is on github under the maintenance of The McDonnell Genome Institute. (I was responsible of refactoring the perl code into C++ and using APIs to make the program more stable and much faster by parallelization. )
  • TIGRA-SV - given putative structural variations, local assembling Illumina split reads and assertaining the presence of breakpoints and further improve the resolution to base pair. The tool is available on bitbucket. A newer version is on github under the maintenance of the McDonnell Genome Institute. (I was responsible of refactoring the perl code into C++ and using APIs to make the program more stable and much faster by parallelization. )
  • CREST - using split reads to detect structural variations. (I was responsible of testing the program.)
  • novoBreak - doing de novo local assembly by pulling all split reads, a tool that won Dream Challenge 8.5. (I took part in debugging the code in dream challenge.)

Direction 2: Cancer heterogeneity deconvolution

Cancer is by nature heterogeneous. After a normal cell gains a mutation (changes on DNA that may lead to cancer, denoted as the yellow diamond in this figure), some of its daughter cells may gain additional mutations (denoted as the red triangle) whereas some others don't. Such process continues until the doctors sequence the patients' genome, which can be understood as taking a screenshot of the patient's DNA. Since the evolutionary tree has different branches, each having their own signatures of mutations, the screenshot also contains multiple clones of cancerous cells. My research goal as a researcher in bioinformatics and computational biology is to characterize all subclones in the cancer and to recover the evolutionary history. There have been two ways to tackle this problem, and they both depend on the data given.

The first approach is by deconvolving the mixture of subclones from bulk sequencing data, where all subclones have been mixed together. BreakDown attempts to analyze such data by analyzing the variant allele fraction (VAF) of structural variations. Since mutations from the same clone have the same VAF, and the VAF of the mutations occuring to a parent clone is the sum of those at the daughter clones, it is possible to cluster all mutations by their VAFs and infer the number of clones, the evolutionary history and placing the mutations on the edges of the tree. Traditionally the VAF of single nucleotide variants is used for such clustering. However, in BreakDown, my colleagues and I found that strucutral variation is more suitable for such a task due to that it involves more genomic fragments and therefore is more sensitive to small clones. We published this paper on BMC Bioinformatics in 2014, and we noticed a recent citation by Nature Communications (11:730, 2020) that discussed BreakDown in detail.

The second approach is by inferring the mutations occurring on single cells and the corresponding evolutionary history from the cells. Single-cell sequencing is a relatively newer sequencing technology, which unlike bulk sequencing, can sequence one cell at a time. Such capability to separate the cells provides the chance to detect mutations on each cell. Suppose the cells represent the leaves on a tree, one can then infer the evolutionary tree based on the mutation profiles on the leaves. Quite a few studies have been done on using single nucleotide variants to recover the tree (see SCITE, OncoNEM, Sifit, SiCloneFit, SCG and BEAM). Their biggiest difference is the model the tool is under. The models include infinit-site model (assuming parsimonious), finite-site model (allowing back mutation, parallel mutation, multiple mutations on the same site) and in between these two are Dollo model. Few studies have been done on using copy number aberrations (CNA) to infer the evolutionary history of cancer. Since CNA played an important role in cancer progression, my work during postdoc focuses on CNAs by single-cell sequencing data.

The followings give more details on my work for each approach to decipher cancer heterogeneity.

  • BreakDown - using read depth, discordant read pairs and split reads to estimate the variant allele fraction of a structural variation, the first tool trying to tackle the problem from the structural variation point of view. The tool is available on bitbucket.
  • Single-cell CNA Benchmark study - single-cell sequencing is important in resolve cancer heterogeneity. CNAs leave the footprints of cancer cell growth and can be used to trace the lineage of cancer cell growth. Current works on CNA detection on single-cell lack a benchmark study and the accuracy of each method is ambiguous. In this study, twenty-eight CNA detection methods were reviewed and categorized from a single-cell perspective, three of which were selected for quantitative analysis from both simulated and real data. Particularly, a single-cell CNA simulator has been designed that mimics the real cancer cell lineage and single-cell sequencing. The manuscript is now on bioRxiv (with simulator code freely downloadable from github).