I develop tools to detect structural variations in human genomes via different sequencing technologies. Structural variations are variations in DNA occurring to >50 bases. It is called "structural" because variations occurring to a large number of bases may change the structure of protein. Detecting structural variations is an important bioinformatics topic because these variations may lead to genetic diseases such as cancer. I devoted near a decade of time to this area and here I list a few tools that I developed or participated in the development.
Cancer is by nature heterogeneous. After a normal cell gains a mutation (changes on DNA that may lead to cancer, denoted as the yellow diamond in this figure), some of its daughter cells may gain additional mutations (denoted as the red triangle) whereas some others don't. Such process continues until the doctors sequence the patients' genome, which can be understood as taking a screenshot of the patient's DNA. Since the evolutionary tree has different branches, each having their own signatures of mutations, the screenshot also contains multiple clones of cancerous cells. My research goal as a researcher in bioinformatics and computational biology is to characterize all subclones in the cancer and to recover the evolutionary history. There have been two ways to tackle this problem, and they both depend on the data given.
The first approach is by deconvolving the mixture of subclones from bulk sequencing data, where all subclones have been mixed together. BreakDown attempts to analyze such data by analyzing the variant allele fraction (VAF) of structural variations. Since mutations from the same clone have the same VAF, and the VAF of the mutations occuring to a parent clone is the sum of those at the daughter clones, it is possible to cluster all mutations by their VAFs and infer the number of clones, the evolutionary history and placing the mutations on the edges of the tree. Traditionally the VAF of single nucleotide variants is used for such clustering. However, in BreakDown, my colleagues and I found that strucutral variation is more suitable for such a task due to that it involves more genomic fragments and therefore is more sensitive to small clones. We published this paper on BMC Bioinformatics in 2014, and we noticed a recent citation by Nature Communications (11:730, 2020) that discussed BreakDown in detail.
The second approach is by inferring the mutations occurring on single cells and the corresponding evolutionary history from the cells. Single-cell sequencing is a relatively newer sequencing technology, which unlike bulk sequencing, can sequence one cell at a time. Such capability to separate the cells provides the chance to detect mutations on each cell. Suppose the cells represent the leaves on a tree, one can then infer the evolutionary tree based on the mutation profiles on the leaves. Quite a few studies have been done on using single nucleotide variants to recover the tree (see SCITE, OncoNEM, Sifit, SiCloneFit, SCG and BEAM). Their biggiest difference is the model the tool is under. The models include infinit-site model (assuming parsimonious), finite-site model (allowing back mutation, parallel mutation, multiple mutations on the same site) and in between these two are Dollo model. Few studies have been done on using copy number aberrations (CNA) to infer the evolutionary history of cancer. Since CNA played an important role in cancer progression, my work during postdoc focuses on CNAs by single-cell sequencing data.
The followings give more details on my work for each approach to decipher cancer heterogeneity.