Post date: Apr 08, 2014 4:50:6 PM
I have been toying with alternative methods for defining linkage groups for Timema, and I have identified one reasonable alternative that we should think about. There are pros and cons with this and our earlier approach. First, here are summaries of the old (1) and new (2) approaches; both approaches start by estimating pairwise recombination rates between all pairs of scaffolds where we have recombination informative markers:
(1) In our previous approach I developed an quick method analogous to an approach used in population genetic clustering. I begin by extracting the top PC's from the matrix of pairwise recombination rates. Scaffolds with similar PC scores should have show similar patterns of recombination (i.e., high and low recombination with the same sets of markers). I then use K-means clustering with K equal to the number of LGs we expect (actually K is a bit higher then the number of LGs because I throw some out, see older posts for details) to identify what should be the centroids of recombination rate PC space for each LG. Finally, I use discriminant function analysis to assign scaffolds to each of the LGs (scaffolds that cannot be confidently assigned are discarded). One major pro and con of this approach is that we are essentially guaranteed to find 13 LGs (or whatever number we want, similar to setting K to 13 in structure). In other words there isn't an absolute recombination rate required for two scaffolds to be assigned to the same vs. different LGs, but rather we find the 'best' LGs given the assumption that there are 13 of them. This is non-standard, and very much a heuristic approach, but I think it makes pretty good use of our data and will give a reasonable answer.
(2) In the new approach I use a hierarchical clustering algorithm that is more akin to standard practices. Here scaffolds are basically joined in a tree, but unlike trees used in e.g. phylogenetics the distance between a scaffold and a cluster (clade) of scaffolds is the minimum distance between that scaffold and all other scaffolds in the cluster (not the mean distance). In an ideal world we would end up with a tree that has 13 deep divisions (one per LG) with many shorter branches for the scaffolds within each LG. We would even have an expectation for the total length of each major clade (basically one Morgan). Unfortunately our result is not quite this clean (see projects/timema_lgs/v2/lgCluster.pdf). There are nice deep splits consistent with expected LGs, but we have about twice as many (depending on where you draw the lines for same vs. different LG) LGs as we should. And the total tree length is too high, because our recombination rates are still upwardly biased by finite coverage. So, we could easily use this approach to identify scaffolds that are on the same LG (and probably with more confidence than method 1), but it would be trick to collapse this all the way down to 13 LGs.
I used the R function agnes for this analysis, but otherwise the code comes from projects/timema_lgs/v2/makelg.R
library(cluster)
out<-agnes(dmatmin,method="single",stand=FALSE)