Research

Here are some of my recent research topics, to give a flavour of my interests.

1. Computational methods

I have spent (wasted?) a lot of time developing methods to find and align related DNA sequences. When I hesitantly started this in 2007, it seemed like the oldest, deadest possible research topic. But it's really fundamental, and surprisingly open-ended.

My starting point was to understand why it is hard (the dominant factor is that we get overwhelmed by massively repeated elements), and what we want (usually a few top hits for each part of each "query" sequence). This led to LAST, which efficiently finds the top m hits at each query position, and so avoids getting swamped by repeats.

LAST had the misfortune to coincide with an explosion of "DNA read mapping" methods, which rather overshadowed it. But it was always different: LAST builds on previous best practices (e.g. BLAST), with statistical models of substitution and gap frequencies, E-values, spaced/subset seeds, etc., to find similarities with arbitrary divergence and length. The LAST philosophy is that all kinds of alignment (short reads, long reads, whole genomes, proteins) follow the same statistical and algorithmic principles.

My favorite LAST feature is split alignment, which aligns each part of each query to a unique best place in the reference, allowing different parts of one query to match completely different places. Because it is statistical model-based, it can report the probability (i.e. confidence / unambiguity) of each alignment part. It can find arbitrarily complex sequence rearrangements, such as those caused by MMBIR (microhomology-mediated break-induced replication) in cancer. It's also good for chimeric sequences, e.g. if different parts of one query come from different viral strains.

1b. tantan

Sequence comparison usually aims to find homologies, i.e. sequences that share a common ancestor. This is confounded by "simple sequences" such as:

tatatacatatgtgtgtgtgtgtatatatatatacacacacacacatatatatgta

Or:

gtttatgattacaaaaataaaataaaaaaattaggtattaaattataactgtaaaa

Simple sequences arise frequently, and non-homologous ones can be highly similar (by the usual statistical measures that regard bases as independent of their neighbors). This is a very fundamental and classic problem, so there are old methods to find and "mask" simple sequences. My younger, naive self was very surprised to discover that these standard methods do not work! That is, they do not reliably prevent false homologies due to simple sequences. By considering how simple sequences evolve (DNA polymerase slippage), I developed a statistical model-based method, tantan, which does seem to work reliably.

Reflection

It seems I have ended up working on old and classic topics. I did not set out to do that, but I think it's OK, because these topics are very fundamental, and it turned out that the previous methods were not the final word.

2. Biology

2a. Promoter properties

In the FANTOM 5 project, I tried to clarify our understanding of mammalian promoters and transcription start sites. These are very fundamental, and many studies have examined features such as CpG islands, TATA boxes, expression breadth (whether it's expressed in a broad range of cell types), etc. But it's all quite confusing, and I wanted to understand why these features exist and vary as they do.

First, I found that we must be very careful how we measure some of these features, to avoid statistical biases and wrong conclusions. In particular, most promoters are expressed quite broadly in many cell types, which is opposite to what many experts seem to believe. Perhaps this should not be surprising, because humans have no more genes than simple animals with fewer cell types, such as worms. This means that most human cell types are not determined by expressing hundreds of unique genes: instead they must be determined by more subtle changes in expression level, or perhaps by what genes they do not express.

Secondly, by considering direct versus indirect correlations between promoter properties, we can understand them better. In particular, CpG islands may be a mostly non-functional consequence of expression (thus lower methylation and mutation of CG dinucleotides) in germ line cells. This correlates "spuriously" with expression breadth, because broadly-expressed promoters are more likely to be active in any given cell type, including germ cells.

2b. Human / chimp genome rearrangements (unpublished)

To show off LAST's split alignment, I decided to survey rearrangements in the human genome relative to chimps and other apes. Human-chimp genome comparison is (again!) fundamental, and previous studies have examined substitutions, deletions, duplications, and inversions, but, surprisingly, not other kinds of rearrangement. I guess they found it too difficult, partly because the published ape genomes are incomplete and messy.

With LAST, we can align rearranged orthologous regions more accurately than before. I found that many rearrangements reflect shattering of DNA into multiple fragments, which then rejoined in random order and orientation, with some fragments lost. The only plausible cause I can think of is natural radiation, e.g. from radon gas or cosmic rays. Occasionally, sister (or homologous) chromosomes get shattered simultaneously, which can produce duplications in the rejoined molecule. I also found rearrangements due to non-allelic homologous recombination, but, to my surprise, none clearly caused by aberrant DNA replication, such as MMBIR.

Of course, I hoped to find rearrangements that affect genes for brain size or the like, but no such luck.

Page updated

Google Sites

Report abuse