Post date: Sep 10, 2013 6:54:54 PM
I made a few changes to the analysis for identifying linkage groups. Specifically, I now only use the first 13 pc's and I use K = 16 (these both correspond to some-what abrupt changes in pve). This generates 13 good linkage groups, which is the correct number of chromosomes. I wrote the results to individual files names lg1.txt to lg13.txt. These files are in projects/timema_lgs/, the first column in each gives the scaffold numbers (in order) and the second gives the recombination distance between pairs of scaffolds. The linkage groups and ordering are certainly far from perfect, but I suspect that they are good enough to be rather useful and I will use them for now. I need to integrate this information with the draft genome. Here is the new R code:
outpca<-prcomp(dmatf2[long,long],center=T)
npca<-13
ss<-numeric(50)
ss[1]<-0
for(i in 2:50){
out<-kmeans(x=outpca$x[,1:npca],i,iter.max=1000,nstart=100,algorithm="Hartigan-Wong")
ss[i]<-out$betweenss/out$totss
}
K<-16
outk<-kmeans(x=outpca$x[,1:npca],K,iter.max=1000,nstart=100,algorithm="Hartigan-Wong")
outlda<-lda(x=outpca$x[,1:npca],grouping=outk$cluster,CV=TRUE)
lg<-as.numeric(as.character(outlda$class))
maxlg<-apply(outlda$posterior,1,max)
lg[maxlg<0.95]<-NA
mns<-numeric(K)
nsc<-numeric(K)
for(i in 1:K){
mns[i]<-mean(dmatf2[long[lg==i],long[lg==i]],na.rm=TRUE)
nsc[i]<-sum(lg==i,na.rm=TRUE)
}
nlg<-sum(mns < 0.3) ## 13 clusters have a mean recombination rate less than 0.3, these are my lgs for now
goodLgs<-which(mns < 0.3)
## get names (numbers) for long scaffolds
lslist<-slist[long]
timMap<-vector(mode='list',length(lslist))
j<-1
for(i in goodLgs){
lglslist<-lslist[which(lg==i)] ## these are the scaffold numbers on this linkage group
out<-UGlinkage(dmatf2[long[which(lg==i)],long[which(lg==i)]])
timMap[[j]]<-list(lglslist[out$order],out$neighbor.dist)
j<-j+1
}
## write results
for(i in 1:nlg){
a<-paste("lg",i,".txt",sep="")
cat(length(timMap[[i]][[1]])," ",length(timMap[[i]][[2]])," ",i,"\n")
write.table(x=cbind(timMap[[i]][[1]],round(timMap[[i]][[2]],4)),file=a,row.names=FALSE,col.names=FALSE,quote=FALSE)
}