I post here to catalog my recent research evolutionary biology, from theory to tutorials to software updates. I love to hear from visitors, so feel free to contact me by email, tweet, or comment.

Fall updates

posted Nov 18, 2013, 7:15 PM by Michael Landis

It's been a busy semester! I'm studying in Durham, North Carolina under the NESCent graduate fellowship, where daily I grow fonder of the city's rugged tobacco-and-cinder aesthetic. Not so fond that I've forgotten Oakland, of course.

While here, part of my plan is to explore new biogeographic range evolution models designed for large numbers of areas (see initial work here). The data augmentation method I'm using was originally introduced to phylogenetics for use with protein evolution by Robinson et al. (2003) from Jeff Thorne's group and others. To my great fortune, being a short bus ride away to Raleigh from Durham, Jeff generously agreed to act as my research advisor for the semester.

In early September, I presented some past work on applying Lévy processes to phylogenetic inference (collaboration with Josh Schraiber and Mason Liang) at the Mathematics for an Evolving Biodiversity workshop hosted at the University of Montréal. A very stimulating conference! Mid-September, I presented on Bayesian biogeography for the Phylogenetics & Evolutionary Biology Seminar group.

With the presentations aside, I've had time to focus more on my fellowship project. With a little effort, I now have a working version of BayArea implemented in RevBayes. It's extremely easy to tinker with new dispersal models in this framework, so I expect my next biogeography inference method will be released solely under RevBayes.

Finally, I'm happy to say that Phylowood -- what I began on a bit of a whim as a "fun and relaxing" post-qualifying exam project, then matured into a Google Summer of Code project -- is now published as a Bioinformatics Application Note (link). Trevor Bedford, my GSoC mentor, was very supportive throughout the whole process, and even covered the Open Access fee.

Here's a gif to demonstrate some of the features of Phylowood:

Plenty in the pipeline to come, but that's all to report on for now!

Visualizing uncertainty in admixture graphs

posted Aug 7, 2013, 6:59 PM by Michael Landis   [ updated Sep 14, 2013, 6:52 AM ]

I've been working on a Bayesian implementation of Joe Pickrell and Jonathan Pritchard's admixture model (implemented as TreeMix). Their model extends Cavalli-Sforza and Edwards' seminal work on phylogenetic analysis, which assumed that modern populations allele frequencies evolved according to a Brownian motion model and covary according to an underlying bifurcating tree. By permitting admixture edges, the Pickrell-Pritchard model can capture signals of gene flow in cases where a bifurcating tree describes covariance in the populations' allele frequencies poorly (which is anything but a rare occurrence in, say, humans). These admixture edges transform our beautiful (though biologically unrealistic) bifurcating tree into a less wieldy tree-like graph. While phylogenetics have used consensus trees to describe topological uncertainty, I couldn't find a good way to summarize uncertainty for this sort of tree-like directed acyclic graph (DAG).

One approach I'm exploring is to plot the majority rule consensus tree for the underlying bifurcating divergence tree. Then conditioning on that tree, plot the admixture edges with posterior probability greater than, say, 0.5 given that the source and destination branches exist. This gives you a conditioned majority rule consensus DAG of sorts.

Now, I've been interested in how well this method works for only a single diploid sample per population, so data was simulated for 20 populations with two samples per population and 100000 SNPs per sample. You can see the resulting conditioned majority rule consensus DAG below, generated by some R scripts that parse RevBayes' MCMC output.

The tree height is one, but scaled by a parameter that captures information about the mean population size, time to the most recent common ancestor, and generation time (not shown). Branch lengths are proportional to time and widths are informative of population size relative to the population size mean (log scale). All divergence events were supported with posterior probability 1.0, so those values are not shown. Although time and population size aren't identifiable, they are useful to separate since I model admixture events to occur instantaneously in time. Admixture edges report their posterior probability and mean posterior admixture weight.

What's important to note is that the analysis records the admixture edges p4->p5 and p3->(p7,p8) as having high posterior probability. There's some uncertainty in the exact placement of edge p3->(p7,p8). The order of admixture events is reversed, partly owing to the non-identifiability of age from population size. Finally, since population size and time aren't identifiable and are inversely related, we see the model redistributes these parameter values fairly evenly (notably, the sister lineage to p0 is lengthened, possibly due to the birth-death prior on divergence times).

True graph:

Inferred graph:

Some updates

posted Jul 2, 2013, 1:29 PM by Michael Landis

Evolution 2013 was very enjoyable, though exhausting (my classmates and I carpooled and camped). To share some of my "paperless" research, such as my Evolution talk, I created a Figshare account.

I received some great feedback about BayArea at the conference, both from empiricists and theorists. Designing new models is the next part, which is driving force behind increasing the number of areas per analysis. As an aside, I updated BayArea to fix a problem for proposing histories for small numbers of areas (N < 10). Thanks to Julien Vieu for mentioning the problem.

Also, I updated creepy-jerk (continuous character evolution using Lévy processes) to improve performance for computing the compound Poisson process with normally distributed jumps. While doing this, I modified the code to produce a FigTree-compatible output file that indicates the size and polarity of jumps on the phylogeny (using the posterior of sampled jumps and the signal-to-noise ratio divided by the square root of the branch length).

Looks nice, I think!

Historical biogeography and statistical inference

posted May 19, 2013, 9:42 PM by Michael Landis   [ updated May 20, 2013, 10:35 AM ]

Biogeography is the study of the distribution of life throughout time and space. Here we'll describe space in terms of species ranges, the shape and area of the geography commonly inhabited by those species. Species ranges rarely remain constant over time, but instead contract, expand, and divide in response to environmental, ecological, and geographical events. We'll focus on historical biogeography, which is chiefly interested in these processes operating over geological timescales. This marriage to geological time is not entirely amicable, because while treating species ranges as functions of time allows us to learn what processes generated biogeographic patterns, time simultaneously complicates the observation of ancestral species ranges we wish to describe (e.g. through taphonomic bias, tectonic drift, etc).

Let's pessimistically assume there's no hope of procuring direct evidence of an ancestral species range (e.g. the complete global fossil record spanning the past six billion years), and instead depend only on data we can reasonably expect to produce for a group of species: the extant species ranges and their shared phylogenetic tree. With a phylogeny in hand and treating the species ranges as random variables, we can model range evolution as a stochastic process and leverage decades of theoretical and computational approaches developed for phylogenetic inference.

Now we can dig into some science equipped with statistical tools. From the likelihood of the extant species ranges, we can assign confidence measures to ancestral species range reconstructions, infer the most likely interval of range evolution parameters (e.g. rate, distance effects), and rule out implausible modes of range evolution via model testing. Of course, biogeography's foibles have introduced its own set of challenges to statistical phylogenetics, which are of primary interest in the posts to come.

I've glossed over many details, hopefully inciting a touch of curiosity or chagrin. In the next post, we'll cover some basics of phylogenetic inference.


Additional reading:

Goldberg, E. E., Lancaster, L. T., & Ree, R. H. (2011). Phylogenetic inference of reciprocal effects between geographic range evolution and diversification. 
Systematic Biology60(4), 451-465.

Lemey, P., Rambaut, A., Drummond, A. J., & Suchard, M. A. (2009). Bayesian phylogeography finds its roots. PLoS Computational Biology5(9), e1000520.

Lemmon, A. R., & Lemmon, E. M. (2008). A likelihood framework for estimating phylogeographic history on a continuous landscape. Systematic Biology57(4), 544-561.

Ree, R. H., Moore, B. R., Webb, C. O., & Donoghue, M. J. (2005). A likelihood framework for inferring the evolution of geographic range on phylogenetic trees. Evolution59(11), 2299-2311.

Ree, R. H., & Sanmartín, I. (2009). Prospects and challenges for parametric models in historical biogeographical inference. Journal of Biogeography36(7), 1211-1220.

Ree, R. H., & Smith, S. A. (2008). Maximum likelihood inference of geographic range evolution by dispersal, local extinction, and cladogenesis. Systematic Biology57(1), 4-14.

Ronquist, F., & Sanmartín, I. (2011). Phylogenetic methods in biogeography. Annual Review of Ecology, Evolution, and Systematics42, 441-464.

Sanmartín, I., Van Der Mark, P., & Ronquist, F. (2008). Inferring dispersal: a Bayesian approach to phylogeny‐based island biogeography, with special reference to the Canary Islands. Journal of Biogeography35(3), 428-449.

BayArea v1.0 release

posted May 9, 2013, 10:48 AM by Michael Landis   [ updated May 20, 2013, 10:35 AM ]

Exciting news! I just uploaded the first version of BayArea, a Bayesian method to infer ancestral species ranges using a molecular phylogeny and presence-absence data. The main focus of the method is to accommodate a very large number of discrete areas, constituting a geography over which species' ranges span and may change with time through the gain and loss of area occupancy. BayArea allows the inclusion of hundreds to thousands of areas per analysis, a substantial improvement from the previous limit of ten to twenty areas.

The source code, manual, and an example dataset are available at http://bayarea.googlecode.comYou should expect to see a Systematic Biology paper exposing the technical details soon (written with John Huelsenbeck, Nick Matzke, and Brian Moore). In series of upcoming posts, I'll give a rundown regarding the motivation for the method, the technical challenges faced implementing the software, and what new interesting biological questions BayArea may help answer.

1-5 of 5