Post date: Sep 06, 2013 8:12:4 PM
My basic plan is to (a) cluster genome scaffolds by mean recombination rate, and (b) try to construct linkage groups with the unidirectional growth algorithm. We will likely need more or better data before we can order scaffolds within linkage groups (distance within a scaffold and recombination are only weakly correlated, see below), but we might be able to find linkage groups reasonably well.
First I grabbed the first seven columns from each recomb file, these contain: scaf. locus a, pos. locus a, scaf. locus b, pos. locus b, number of individuals with high quality (90% certainty) genotypes, p-vaule for the null model of Mendelian segregation, and the recombination rate estimate,
cut -f 1-7 -d " " recomb0.9_p0_offspringGenotypes.txt > est_recomb0.9_p0_offspringGenotypes.txt
cut -f 1-7 -d " " recomb0.9_p1_offspringGenotypes.txt > est_recomb0.9_p1_offspringGenotypes.txt
Next I concatenated all of the recombination estimate files and wrote them to a single file, currently in scratch, recomb_estimates_all.txt. I then analyzed the recombination rates in R (the commands, files, and workspace for this analysis are all in projects/timema_lg). My main findings are the recombination rates are now much more reasonable for variants on the same scaffolds (but perhaps still a little high, I don't think the low data quality and genotype uncertainty issues have been fully removed). See the recombination rate quantile plot for details. But there is still only a very weak relationship between physical distance and recombination rate on scaffolds (see the scatter plot). This is probably because of genotype uncertainty, but it could also be real and reflect substantial variation in the recombination rate. Either way this means that we cannot convert recombination rates between scaffolds to physical distances.
I calculated the mean recombination rate between each pair of ~3000 scaffolds (missing data was set to 0.5), and used this to visualize and hierarchically cluster the scaffolds into putative linkage groups. This worked reasonably well (see results for 100 or 500 scaffolds, full results are on the computer), but there is also some complexity and I will need to try a formal linkage map generation method to see whether I can correctly recover 26 linkage groups. I am going to try Quirong's R implementation of the unidirectional growth algorithm once Alex has a few issues worked out. I could also try other packages, but they do not seem to work directly from a recombination rate matrix. In the end we will probably need even more sequence data, but we can at least get some approximation of linkage groups for now.
Here is the R code I used,
## read data = recombinationr rates and quality
recomb<-matrix(scan("~/labs/evolution/scratch/recomb_estimates_all.txt",n=7*79017077,sep=" "),nrow=79017077,ncol=7,byrow=T)
## identify and select the high quality subset of the data, Mendealian two-tailed prob. of 0.01 (0.005 on each side), and 10 indviduals
hqr<-which(recomb[,6] > 0.005 & recomb[,5] >= 10)
hqdata<-recomb[hqr,]
## orientate recombination fractions by folding
hqdata[hqdata[,7] > 0.5,7]<-1-hqdata[hqdata[,7] > 0.5,7]
## quantile plots for recomb rate, same vs. different scaffolds
ss<-which(hqdata[,1]==hqdata[,3]) ## 13,058 SNV pairs
ds<-which(hqdata[,1]!=hqdata[,3]) ## 16,706,273 SNV pairs
nss<-length(ss)
nds<-length(ds)
pdf("recombQuantilePlot.pdf",width=6,height=6)
plot(seq(1,nds,1)/nds,sort(hqdata[ds,7]),type='l',lwd=2,cex.lab=1.4,cex.axis=1.2,xlab="quantile",ylab="recombination fraction")
lines(seq(1,nss,1)/nss,sort(hqdata[ss,7]),lwd=2,col="orange")
legend(0.05,0.5,legend=c("same scaf.","diff. scaf."),col=c("orange","black"),lwd=2)
dev.off()
## plot distance vs. recomb. within scaffolds
pdf("recombDistPlot.pdf",width=6,height=6)
plot(hqdata[ss,4]-hqdata[ss,2],hqdata[ss,7],cex.lab=1.4,cex.axis=1.2,xlab="physical distance",ylab="recombinatrion fraction")
text(2500000,0.48,"r = 0.025\np = 0.004")
dev.off()
cor.test(hqdata[ss,4]-hqdata[ss,2],hqdata[ss,7])
## r = 0.025, p = 0.004
## mean recomb. for scaffolds, create pairwise matrix
slist<-unique(c(unique(hqdata[,1])),unique(hqdata[,3]))
l<-length(slist)
dmat<-matrix(0,nrow=l,ncol=l)
nmat<-matrix(0,nrow=l,ncol=l)
N<-dim(hqdata)[1]
for(n in 1:N){
cat(n,"\n")
a<-which(slist==hqdata[n,1])
b<-which(slist==hqdata[n,3])
nmat[a,b]<-nmat[a,b] + 1
dmat[a,b]<-dmat[a,b] + hqdata[n,7]
}
dmatf<-dmat/nmat
for(i in 1:l){
for(j in i:l){
dmatf[j,i]<-dmatf[i,j]
}
}
dmatf2<-dmatf ## replace nan with 0.5, this is conservative
dmatf2[is.nan(dmatf)]<-0.5
## heatmap plot
nsc<-100
pdf("lgheatmap100.pdf",width=20,height=20)
heatmap.2(x=dmatf2[1:nsc,1:nsc],distfun=as.dist,symm=TRUE,col="terrain.colors",trace="none",Colv=TRUE,Rowv=TRUE,scale="none",na.rm=TRUE,dendrogram="both")
dev.off()