Post date: Sep 10, 2013 4:35:44 PM
I generated provisional linkage groups from the Timema recombination rates using pca, k-means clustering, and lda. I somewhat arbitrarily am using the first 26 pc's (there should be 26 linkage groups). I then examined the relationship between the number of clusters and the proportion of variation partitioned among clusters. This increases rapidly then slowly asymptotes. Again somewhat arbitrarily I chose k = 30 for creating linkage groups (previously I noted that some clusters appear to be leftover junk, so I wanted k > expected number of linkage groups). I than ran lda. I only kept scaffolds with high assignment probability to a cluster (> 0.95) and clusters with low (< 0.3) mean recombination rate. This gave me 23 clusters. I then ordered scaffolds within clusters using Qiurong's implementation of the unidirectional growth algorithm (in 'ug.R'), and saved the results in timMap (a list in the lg.rwsp). The results are ok. Most neighboring scaffolds have low recombination rates, but this is not true of all pairs. Also, there are not enough really low recombination rates (the sum of recombination rates along scaffolds is too high (e.g., for list scaffolds 15 it is 11.85, but it should be 1). Thus, I think the scaffolds are over-assembled, and that genotype uncertainty is still affecting the results, but that most of the scaffolds in the same provisional linkage group are probably on the same chromosome (or maybe on a few chromosomes). This is useful, but we need more sequencing. Here is my R code:
load("lg.rwsp")
outpca<-prcomp(dmatf2[long,long],center=T)
ss<-numeric(50)
ss[1]<-0
for(i in 2:50){
out<-kmeans(x=outpca$x[,1:26],i,iter.max=1000,nstart=100,algorithm="Hartigan-Wong")
ss[i]<-out$betweenss/out$totss
}
outk<-kmeans(x=outpca$x[,1:26],30,iter.max=1000,nstart=100,algorithm="Hartigan-Wong")
outlda<-lda(x=outpca$x[,1:26],grouping=outk$cluster,CV=TRUE)
lg<-as.numeric(as.character(outlda$class))
maxlg<-apply(outlda$posterior,1,max)
lg[maxlg<0.95]<-NA
mns<-numeric(30)
nsc<-numeric(30)
for(i in 1:30){
mns[i]<-mean(dmatf2[long[lg==i],long[lg==i]],na.rm=TRUE)
nsc[i]<-sum(lg==i,na.rm=TRUE)
}
sum(mns < 0.3) ## 23 clusters have a mean recombination rate less than 0.3, these are my lgs for now
goodLgs<-which(mns < 0.3)
## get names (numbers) for long scaffolds
lslist<-slist[long]
timMap<-vector(mode='list',length(lslist))
j<-1
for(i in goodLgs){
lglslist<-lslist[which(lg==i)] ## these are the scaffold numbers on this linkage group
out<-UGlinkage(dmatf2[long[which(lg==i)],long[which(lg==i)]])
timMap[[j]]<-list(lglslist[out$order],out$neighbor.dist)
j<-j+1
}